Hitachi Vantara Pentaho Community Wiki
Child pages
  • Using a Custom Input or Output Format in Pentaho MapReduce
Skip to end of metadata
Go to start of metadata

How to use a custom Input or Output Format in Pentaho MapReduce. In some situations you may need to use a input or output format beyond the base formats included in Hadoop. In this guide you are going to develop and implement a custom output format that names the files the year of the data instead of the default part-00000 name. Although this guide implements a custom output format the same steps could also be used for an input format. For more information on file formats:


In order to follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Pentaho Hadoop Distribution

Sample Files

The sample data file needed for this guide is:

File Name


Parsed, raw weblog data

Note: If you have already completed the Using Pentaho MapReduce to Parse Weblog Data guide the data should already be in the correct spot.

Add the file to your cluster by running the following:

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Sample Code

This guide expands upon the Using a Custom Partitioner in Pentaho MapReduce guide. If you have completed this guide you should already have the necessary code, otherwise download aggregate_mapper.ktr, aggregate_reducer.ktr, and aggregate_mr_partition.kjb.

Step-By-Step Instructions


Start Hadoop if it is not already running.

Create a Custom Output Format in Java

In this task you will create a output format that takes a key in the format client_ip tab year tab month and writes a file for each year that is also named the year number.

Speed Tip

You can download CustomFileFormats.jar containing the output format if you do not want to do every step

  1. Create Year Output Format Class: In a text editor create a new file named containing the following code:
    import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
    public class YearMultipleTextOutputFormat extends MultipleTextOutputFormat<Text,LongWritable> {
    	protected String generateFileNameForKeyValue(Text key, LongWritable value, String name) {
    		String skey = key.toString();
    		String[] splits = skey.split("\t");  //split the key on the delimiter tab
    		String year = splits[1]; //The year is the second field
    		return year;  //return year as the file name
  2. Compile the Class: Run the following command:
    javac -classpath ${HADOOP_HOME}/hadoop-core.jar
  3. Collect the Class into a JAR: Run the following command:
    jar cvf CustomFileFormats.jar YearMultipleTextOutputFormat.class

Deploy the JAR

  1. Deploy the JAR to Pentaho: Pentaho validates the file and input and output format classes exist before submitting to the cluster, so add the jar to Kettle's classpath by copying the jar to $KETTLE_HOME/libext/pentaho and restarting Spoon. If running from a Spoon client, copy the jar to the shim's lib/client folder.
  2. Deploy the JAR to Hadoop: Add the jar to Hadoop's distributed cache by running the following commands:
    hadoop fs -mkdir /distcache
    hadoop fs -put CustomFileFormats.jar /distcache/}

Configure Pentaho MapReduce

In this task you will configure Pentaho MapReduce to use the custom output format.

Speed Tip

You can download the already configured job aggregate_mr_output_format.kjb if you do not want to do every step

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select the 'aggregate_mr_partition.kjb', then click 'OK'.
  2. Configure the Output Format: Double click on the 'Pentaho MapReduce' job entry. Once it is open switch to the 'Job Setup' tab and change the 'Output Format' to 'YearMultipleTextOutputFormat'.
  3. Add Output Format to Distributed Cache: Switch to the User Defined tab and do the following:




    Add ,/distcache/CustomFileFormats.jar to the existing value.


    Add ,/distcache/CustomFileFormats.jar to the existing value.

    Click OK to close the window.
  4. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'aggregate_mr_output_format.kjb' into a folder of your choice.
  5. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully.

Check Hadoop

  1. List the Output Files: Listing the output files should return a file named 2010 and a file named 2011.
    hadoop fs -ls /user/pdi/weblogs/aggregate_mr
  • No labels