Hitachi Vantara Pentaho Community Wiki
Child pages
  • Using Compression with Pentaho MapReduce

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Step-By-Step Instructions

Reading Compressed Files

In this task you will configure Pentaho MapReduce to read compressed files into the Map/Reduce Input.

Tip

The following compression codecs are automatically supported by Pentaho MapReduce. You do not need to do any configuration to read a file using these codecs.

  1. Create Year Partitioner Class: In a text editor create a new file named YearPartitioner.java containing the following code:

Normally there is nothing you need to do to have Pentaho MapReduce use a compressed file as the the input. Pentaho MapReduce will automatically decompress any compression codec installed on the Hadoop cluster.

Writing Compressed Files

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
  2. Configure the Compression Codec: Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:

    Name

    Value

    mapred.output.compression.codec

    The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec

    mapred.output.compress

    true

    mapred.output.compression.type

    BLOCK

    Image Added
  3. Run your job

The output from the job should be compressed using the codec you specified.

Compressing Intermediate Data

You may want to compress the intermediate data that is passed between the Pentaho Mappers and Reducers to reduce network i/o and in some cases improve performance.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
  2. Configure the Compression Codec: Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:

    Name

    Value

    mapred.map.output.compression.codec

    The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec

    mapred.compress.map.output

    true

    Image Added
  3. Run your job