How to read data from a data source (flat file) and write it to a column family in Cassandra using a graphic tool. By the end of this guide you should understand how data can be read from many different data sources and written to Cassandra. The data we are going to use contains data about the flow of visitors to a web site.
In order follow along with this how-to guide you will need the following:
A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will work as well. You will need to know the address and port that Cassandra is running on and have a user id and password for the server (if applicable).
These guides were developed using the Apache Cassandra distribution version 1.0.3. You can find Apache Cassandra downloads here: http://cassandra.apache.org/download/
A desktop installation of the Kettle design tool called 'Spoon'. Download here.
The sample data files for this guide is called page_successions.txt.zip
Start Cassandra if is not running.
Using the Cassandra command line interface (CLI), create a keyspace to use for this exercise.
- To start the Cassandra CLI, at a command line in the Cassandra home directory type:
- Once the Cassandra CLI has started type:
create keyspace Demo;
- Start Spoon on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
You can download the Kettle Transform populate_cassandra_page_successions.ktr already completed
- Add a Text File Input Step: We are going to read data from a text file, so expand the 'Input' section of the Design palette and drag a 'Text file input' step onto the transformation canvas.
Notice that there are lots of other inputs that we could have used such as a database (including Hive), applications, and specific file formats. Under the Hadoop section there are other input including HDFS, HBase, and MapReduce.
- Select the file: Double-click on the 'Text file input' step to edit it's properties. Click on the 'Browse' button on the right side of the dialog to select a file. Locate the page_successions.txt file. Click on the 'Add' button to add the file to the selected files list. The dialog should look something like this:
- Create Data Fields: Click on the 'Fields' tab. Then click the 'Get Fields' button. Click 'OK' to sample 100 lines. You will see the 'Scan results' window. When you close the 'Scan results' window you will see the fields filled in for you:
- Preview Data: Click on the 'Preview Rows' button and accept 1000 as the number of rows to preview. You will see a table of preview data read from the text file:
- Add a Cassandra Output Step: Close the preview window and click on 'OK' on the 'Text file input' window. On the design palette expand the 'Big Data' section and drag a 'Cassandra Output' step onto the transformation canvas. Your canvas should look like this:
- Connect the Input and Output Steps: Hover the mouse over the 'Text file input' step and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Cassandra Output' step. Your canvas should look like this:
- Edit the Cassandra Output Step: Double-click on the 'Cassandra Output' step to edit its properties. Enter this information:
- Cassandra host, Cassandra port, Username and Password: the connection information for your Cassandra installation.
- Keyspace: The name of the keyspace you created in step 2 above – 'Demo'.
- Column family (table): Enter 'PageSuccessions'
- Incoming field to use as the key: Click on the 'Get Fields' button to populate the drop-down list, then choose the field 'key' from the list.
- Create column family: Checked. This will create the column family if it does not exist.
- Truncate column family: Checked. This will empty the PageSuccessions column family before adding the incoming data.
- Update column family meta data: Checked. This will make the column family metadata consistent with the fields of data.
When you are done your 'Cassandra Output' window should look like this (your connection information may be different):
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'populate_cassandra_page_successions.ktr' into a folder of your choice.
- Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the Spoon window and it will show you the progress of the transformation as it runs. After a few seconds the transformation should finish successfully:
If any errors occurred the transformation step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
- Using the Cassandra CLI, type:
You should see a result like this:
=> (column=Count, value=12, timestamp=1325632014571099)
=> (column=nextUrl, value=/events, timestamp=1325632014571101)
=> (column=url, value=/about, timestamp=1325632014571100) Returned 3 results.
Elapsed time: 4 msec(s).
This data tells you that the number of times visitors went from the About page to the Events page was 12 during the timeframe in the data.
During this guide you learned how to populate a Cassandra column family using Kettle's graphical design tool. You can use can use this tool to load data into Cassandra from many data sources.
Other guides in this series cover to sort and group Cassandra data, create reports, and combine data from Cassandra with data from other sources.