Access Keys:
Skip to content (Access Key - 0)

How to use a custom Input or Output Format in Pentaho MapReduce. In some situations you may need to use a input or output format beyond the base formats included in Hadoop. In this guide you are going to develop and implement a custom output format that names the files the year of the data instead of the default part-00000 name. Although this guide implements a custom output format the same steps could also be used for an input format. For more information on file formats: http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

Prerequisites

In order to follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Pentaho Hadoop Distribution

Sample Files

The sample data file needed for this guide is:

File Name Content
weblogs_parse.txt.zip Parsed, raw weblog data

Note: If you have already completed the Using Pentaho MapReduce to Parse Weblog Data guide the data should already be in the correct spot.

Add the file to your cluster by running the following:

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Sample Code

This guide expands upon the Using a Custom Partitioner in Pentaho MapReduce guide. If you have completed this guide you should already have the necessary code, otherwise download aggregate_mapper.ktr, aggregate_reducer.ktr, and aggregate_mr_partition.kjb.

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Create a Custom Output Format in Java

In this task you will create a output format that takes a key in the format client_ip tab year tab month and writes a file for each year that is also named the year number.

Speed Tip
You can download CustomFileFormats.jar containing the output format if you do not want to do every step
  1. Create Year Output Format Class: In a text editor create a new file named YearMultipleTextOutputFormat.java containing the following code:
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
    
    
    public class YearMultipleTextOutputFormat extends MultipleTextOutputFormat<Text,LongWritable> {
    
    	protected String generateFileNameForKeyValue(Text key, LongWritable value, String name) {
    		String skey = key.toString();
    		String[] splits = skey.split("\t");  //split the key on the delimiter tab
    		String year = splits[1]; //The year is the second field
    		return year;  //return year as the file name
    	}
    }
  2. Compile the Class: Run the following command:
    javac -classpath ${HADOOP_HOME}/hadoop-core.jar YearMultipleTextOutputFormat.java
  3. Collect the Class into a JAR: Run the following command:
    jar cvf CustomFileFormats.jar YearMultipleTextOutputFormat.class

Deploy the JAR

  1. Deploy the JAR to Pentaho: Pentaho validates the file and input and output format classes exist before submitting to the cluster, so add the jar to Kettle's classpath by copying the jar to $KETTLE_HOME/libext/pentaho and restarting Spoon.
  2. Deploy the JAR to Hadoop: Add the jar to Hadoop's distributed cache by running the following commands:
    hadoop fs -mkdir /distcache
    hadoop fs -put CustomFileFormats.jar /distcache/}

Configure Pentaho MapReduce

In this task you will configure Pentaho MapReduce to use the custom output format.

Speed Tip
You can download the already configured job aggregate_mr_output_format.kjb if you do not want to do every step
  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select the 'aggregate_mr_partition.kjb', then click 'OK'.
  2. Configure the Output Format: Double click on the 'Pentaho MapReduce' job entry. Once it is open switch to the 'Job Setup' tab and change the 'Output Format' to 'YearMultipleTextOutputFormat'.
  3. Add Output Format to Distributed Cache: Switch to the User Defined tab and do the following:
    Name Value
    mapred.cache.files Add ,/distcache/CustomFileFormats.jar to the existing value.
    mapred.job.classpath.files Add :/distcache/CustomFileFormats.jar to the existing value.


    Click OK to close the window.

  4. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'aggregate_mr_output_format.kjb' into a folder of your choice.
  5. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully.

Check Hadoop

  1. List the Output Files: Listing the output files should return a file named 2010 and a file named 2011.
    hadoop fs -ls /user/pdi/weblogs/aggregate_mr

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki