Access Keys:
Skip to content (Access Key - 0)

Pentaho MapReduce

Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History
Resources

The Pentaho MapReduce job entry allows you to build MapReduce jobs using Kettle transformations as the Mapper, Combiner, and/or Reducer.

Architecture Overview

TODO

Type Mapping

In order to pass data between Hadoop and Kettle we must convert between Hadoop IO data types. Here's the type mapping for the built in Kettle types:

Kettle Type Hadoop Type
ValueMetaInterface.TYPE_STRING org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_BIGNUMBER org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_DATE org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_INTEGER org.apache.hadoop.io.LongWritable
ValueMetaInterface.TYPE_LONG org.apache.hadoop.io.DoubleWritable
ValueMetaInterface.TYPE_BOOLEAN org.apache.hadoop.io.BooleanWritable
ValueMetaInterface.TYPE_BINARY org.apache.hadoop.io.BytesWritable

Defining your own Type Converter

TODO
See org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter for more info. We use the Service Locator pattern; specifically Java's ServiceLoader.

Distributed Cache

Pentaho MapReduce relies on Hadoop's Distributed Cache to distribute the Kettle environment, configuration, and plugins across the cluster. By leveraging the Distributed Cache network traffic is reduced up for subsequent executions as the Kettle environment is automatically configured on each node. This also allows you to use multiple version of Kettle against a single cluster.

How it works

Hadoop's Distributed Cache is a mechanism to distribute files into the working directory of each map and reduce task. The origin of these files is HDFS. Pentaho MapReduce will automatically configure the job to use a Kettle environment from HDFS (configured via pmr.kettle.installation.id, see ConfigurationOptions). If the desired Kettle environment does not exist, Pentaho MapReduce will take care of "installing" it in HDFS before executing the job.

The default Kettle environment installation path within HDFS is /opt/pentaho/mapreduce/$id, where $id is generally the version of Kettle the environment contains but can easily be a custom build that is tailored for a specific set of jobs.

Configuration options

Pentaho MapReduce can be configured through the pentaho-mapreduce.properties found in the plugin's base directory, or overridden per Pentaho MapReduce job entry if they are defined in the User Defined properties tab.

The currently supported configuration properties are:

Property Name Description
pmr.kettle.installation.id Version of Kettle to use from the Kettle HDFS installation directory. If not set we will use the version of Kettle that is used to submit the Pentaho MapReduce job.
pmr.kettle.dfs.install.dir Installation path in HDFS for the Kettle environment used to execute a Pentaho MapReduce job. This can be a relative path, anchored to the user's home directory, or an absolute path if it starts with a /.
pmr.libraries.archive.file Pentaho MapReduce Kettle environment runtime archive to be preloaded into kettle.hdfs.install.dir/pmr.kettle.installation.id
pmr.kettle.additional.plugins Comma-separated list of additional plugins (by directory name) to be installed with the Kettle environment.
e.g. "steps/DummyPlugin,my-custom-plugin"

Customizing the Kettle Environment used by Pentaho MapReduce

The installation environment used by Pentaho MapReduce will be installed to pmr.kettle.dfs.install.dir/pmr.kettle.installation.id when the Pentaho MapReduce job entry is executed. If the installation already exists no modifications will be made and the job will use the environment as is. That means any modifications after the initial run, or any custom pre-loading of a kettle environment, will be used as is by Pentaho MapReduce.

Customizing the libraries used in a fresh Kettle environment install into HDFS

The pmr.libraries.archive.file contents are copied into HDFS at pmr.kettle.dfs.install.dir/pmr.kettle.installation.id. To make changes for initial installations, you must edit the archive referenced by this properly.

  1. Unzip pentaho-mapreduce-libraries.zip, it contains a single lib/ directory with the required Kettle dependencies
  2. Copy additional libraries to the lib/ directory
  3. Zip up the lib/ directory into pentaho-mapreduce-libraries-custom.zip so the archive contains the lib/ with all jars within it (you may create subdirectories within lib/. All jars found in lib/ and its subdirectories will be added to the classpath of the executing job.)
  4. Update pentaho-mapreduce.properties and update the following properties:
    pmr.kettle.installation.id=custom
    pmr.libraries.archive.file=pentaho-mapreduce-libraries-custom.zip
    

The next time you execute Pentaho MapReduce the custom Kettle environment will be copied into HDFS at pmr.kettle.dfs.install.dir/custom and used when executing the job. You can switch between Kettle environments by specifying the pmr.kettle.installation.id property as a User Defined property per Pentaho MapReduce job entry or globally in the pentaho-mapreduce.properties file*.

*Note: Only if the installation referenced by pmr.kettle.installation.id does not exist will the archive file and additional plugins currently configured will be used to "install" it into HDFS.

Customizing an existing Kettle environment in HDFS

You can customize an existing Kettle environment install in HDFS by manually copying jars and plugins into HDFS. This can be done manually (hadoop fs -copyFromLocal <localsrc> ... <dst> or with the Hadoop Copy Files job entry.

See Appendix B for the supported directory structure in HDFS.

Adding JDBC drivers to the Kettle environment

JDBC drivers and their required dependencies must be placed in the installation directory's lib/ directory.

Upgrading from the Pentaho Hadoop Distribution (PHD)

The PHD is no longer required and can be safely removed. If you have modified your Pentaho Hadoop Distribution installation you may wish to preserve these files so that the new Distributed Cache mechanism can take advantage of them. To do so follow the instructions above: Customizing the Kettle Environment used by Pentaho MapReduce.

If you're using a version of the Pentaho Hadoop Distribution (PHD) that allows you to configure the installation directory via mapred-site.xml, perform the following on all TaskTracker nodes:

  1. Remove the pentaho.* properties from your mapred-site.xml
  2. Remove the directories those properties referenced
  3. Restart the TaskTracker process

Appendix A: pentaho-mapreduce-libraries.zip structure

pentaho-mapreduce-libraries.zip/
  `- lib/
      +- kettle-core-{version}.jar
      +- kettle-engine-{version}.jar
      `- .. (all other required Kettle dependencies and optional jars)

Appendix B: Example Kettle environment installation directory structure within DFS

/opt/pentaho/mapreduce/
  +- 4.3.0/
  |   +- lib/
  |   |   +- kettle-core-{version}.jar
  |   |   +- kettle-engine-{version}.jar
  |   |   `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers)
  |   `- plugins/
  |       +- pentaho-big-data-plugin/
  |       `- .. (additional optional plugins)
  `- custom/
      +- lib/
      |   +- kettle-core-{version}.jar
      |   +- kettle-engine-{version}.jar
      |   +- my-custom-code.jar
      |   `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers)
      `- plugins/
          +- pentaho-big-data-plugin/
          |   ..
          `- my-custom-plugin/
              ..
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder Powered by Atlassian Confluence