How to read data from a data source (flat file) and write it to a collection in MongoDB. By the end of this guide you should understand how data can be read from many different data sources and written to MongoDB. The data we are going to use contains data about the flow of visitors to a web site.
In order follow along with this how-to guide you will need the following:
A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will work as well. You will need to know the address and port that MongoDB is running on and have a user id and password for the server (if applicable).
These guides were developed using the MongoDB version 2.0.2. You can find MongoDB downloads here: http://www.mongodb.org/downloads
Pentaho Data Integration
A desktop installation of the Kettle design tool called 'Spoon'. Download here.
The sample data files for this guide is called page_successions.txt.zip
Start MongoDB if is not running.
Create a Data Transformation
Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
You can download the Kettle Transform populate_mongodb_page_successions.ktr already completed
- Add a Text File Input Step: We are going to read data from a text file, so expand the 'Input' section of the Design palette and drag a 'Text file input' step onto the transformation canvas.
Notice that there are lots of other inputs that we could have used such as a database (including Hive), applications, and specific file formats. Under the Big Data section there are other input including Cassandra, HBase, and MapReduce. Select the file: Double-click on the 'Text file input' step to edit it's properties. Click on the 'Browse' button on the right side of the dialog to select a file. Locate the page_successions.txt file. Click on the 'Add' button to add the file to the selected files list. The dialog should look something like this:
- Create Data Fields: Click on the 'Fields' tab. Then click the 'Get Fields' button. Click 'OK' to sample 100 lines. You will see the 'Scan results' window. When you close the 'Scan results' window you will see the fields filled in for you:
- Preview Data: Click on the 'Preview Rows' button and accept 1000 as the number of rows to preview. You will see a table of preview data read from the text file:
- Add a MongoDB Output Step: Close the preview window and click on 'OK' on the 'Text file input' window. On the design palette expand the 'Big Data' section and drag a 'MongoDb Output' step onto the transformation canvas. Your canvas should look like this:
- Connect the Input and Output Steps: Hover the mouse over the 'Text file input' step and a tooltip will appear.
Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'MongoDb Output' step. Your canvas should look like this:
- Edit the MongoDB Output Step: Double-click on the 'MongoDB Output' step to edit its properties. Enter this information on the 'Configure Connection' tab:
- The host, port, Username and Password: the connection information for your MongoDB installation.
- Database: 'Demo'
- Collection: 'PageSuccessions'
- Truncate collection: Checked. This will empty the PageSuccessions collection before adding the incoming data.
On the 'Mongo document fields' tab click on the 'Get Fields' button to populate the table.
On the 'Create/drop indexes' tab, specify that we want to create an index on the 'url' field.
Click 'OK' to close the window.
- Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'populate_mongodb_page_successions.ktr' into a folder of your choice.
- Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After a few seconds the transformation should finish successfully:
If any errors occurred the transformation step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.
Check the MongoDB Collection
- Using the Mongo CLI, type:
SummaryDuring this guide you learned how to populate a MongoDB collection using PDI's graphical design tool. You can use can use this tool to load data into MongoDB from many data sources.
Other guides in this series cover to sort and group MongoDB data, create reports, and combine data from MongoDB with data from other sources.
You should see a result like this: