Hitachi Vantara Pentaho Community Wiki
Child pages
  • Using a Custom Input or Output Format in Pentaho MapReduce

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: changed mapred.job.classpath.files value


  1. Deploy the JAR to Pentaho: Pentaho validates the file and input and output format classes exist before submitting to the cluster, so add the jar to Kettle's classpath by copying the jar to $KETTLE_HOME/libext/pentaho and restarting Spoon. If running from a Spoon client, copy the jar to the shim's lib/client folder.
  2. Deploy the JAR to Hadoop: Add the jar to Hadoop's distributed cache by running the following commands:
    Code Block
    hadoop fs -mkdir /distcache
    hadoop fs -put CustomFileFormats.jar /distcache/}


  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select the 'aggregate_mr_partition.kjb', then click 'OK'.
  2. Configure the Output Format: Double click on the 'Pentaho MapReduce' job entry. Once it is open switch to the 'Job Setup' tab and change the 'Output Format' to 'YearMultipleTextOutputFormat'.
  3. Add Output Format to Distributed Cache: Switch to the User Defined tab and do the following:




    Add ,/distcache/CustomFileFormats.jar to the existing value.


    Add :,/distcache/CustomFileFormats.jar to the existing value.

    Click OK to close the window.
  4. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'aggregate_mr_output_format.kjb' into a folder of your choice.
  5. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. A 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully.