|Using a Custom Partitioner in Pentaho MapReduce||Advanced Pentaho MapReduce||Processing HBase data in Pentaho MapReduce using TableInputFormat|
How to use a custom Input or Output Format in Pentaho MapReduce. In some situations you may need to use a input or output format beyond the base formats included in Hadoop. In this guide you are going to develop and implement a custom output format that names the files the year of the data instead of the default part-00000 name. Although this guide implements a custom output format the same steps could also be used for an input format. For more information on file formats: http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat
In order to follow along with this how-to guide you will need the following:
- Pentaho Data Integration
- Pentaho Hadoop Distribution
The sample data file needed for this guide is:
|weblogs_parse.txt.zip||Parsed, raw weblog data|
Note: If you have already completed the Using Pentaho MapReduce to Parse Weblog Data guide the data should already be in the correct spot.
Add the file to your cluster by running the following:
This guide expands upon the Using a Custom Partitioner in Pentaho MapReduce guide. If you have completed this guide you should already have the necessary code, otherwise download aggregate_mapper.ktr, aggregate_reducer.ktr, and aggregate_mr_partition.kjb.
Start Hadoop if it is not already running.
In this task you will create a output format that takes a key in the format client_ip tab year tab month and writes a file for each year that is also named the year number.
You can download CustomFileFormats.jar containing the output format if you do not want to do every step
- Create Year Output Format Class: In a text editor create a new file named YearMultipleTextOutputFormat.java containing the following code:
- Compile the Class: Run the following command:
- Collect the Class into a JAR: Run the following command:
- Deploy the JAR to Pentaho: Pentaho validates the file and input and output format classes exist before submitting to the cluster, so add the jar to Kettle's classpath by copying the jar to $KETTLE_HOME/libext/pentaho and restarting Spoon.
- Deploy the JAR to Hadoop: Add the jar to Hadoop's distributed cache by running the following commands:
In this task you will configure Pentaho MapReduce to use the custom output format.
You can download the already configured job aggregate_mr_output_format.kjb if you do not want to do every step
- Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select the 'aggregate_mr_partition.kjb', then click 'OK'.
- Configure the Output Format: Double click on the 'Pentaho MapReduce' job entry. Once it is open switch to the 'Job Setup' tab and change the 'Output Format' to 'YearMultipleTextOutputFormat'.
- Add Output Format to Distributed Cache: Switch to the User Defined tab and do the following:
Name Value mapred.cache.files Add ,/distcache/CustomFileFormats.jar to the existing value. mapred.job.classpath.files Add :/distcache/CustomFileFormats.jar to the existing value.
Click OK to close the window.
- Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'aggregate_mr_output_format.kjb' into a folder of your choice.
- Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully.
- List the Output Files: Listing the output files should return a file named 2010 and a file named 2011.