How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.
|Pre-6.0 Documentation Warning|
This documentation is for Pentaho 5.2, 5.3, and 5.4. You can find documentation for Pentaho 6.x here: Set Up Pentaho to Connect to a Hadoop Cluster. Other Pentaho documentation can be found at Pentaho Help: https://help.pentaho.com/Documentation.
Older Hadoop Configuration documentation can be found here:
Pentaho supports different versions of Hadoop distributions from several vendors such as Cloudera, Hortonworks, and MapR. To support this many versions, Pentaho uses an abstraction layer, called a shim, that connects to the different Hadoop distributions. A shim is a small library that intercepts API calls and redirects or handles them, or changes the calling parameters. Periodically, Pentaho develops new shims as vendors develop new Hadoop distributions and versions. These big data shims are tested and certified by Pentaho engineers. The following steps will help you get Pentaho set up to work with your Hadoop cluster.
|Shim support policy|
Pentaho provides Enterprise Edition support for distributions by Cloudera, Hortonworks and MapR. It is Pentaho's intention to support new distro's and versions by these vendors as soon as we can after release. This support is made available via monthly service packs.
Due to the rapid pace of development and frequency of releases on the part of the distro vendors, Pentaho can only test and fully support the last two major releases for each vendor. There is no reason that previous shims would stop working as new ones are released but we have to draw the line at 2.
Our support matrix in our documentation lists the currently-supported Hadoop distributions. Make sure that the Hadoop distribution you want to use is supported:
NOTE: Pentaho is pre-configured for Apache Hadoop 0.20.2. If you are using this distribution and version, no further configuration is required.
- If the Hadoop Distribution that you want to use is not listed, visit our 5.1/5.2 and 5.0 pages; a previous version of our software might support older Hadoop Distributions. Downloads appear there as well.
- You can request that Pentaho develop a shim for a distribution by contacting sales or filling out a Jira ticket or if you are an EE subscription customer, you can contact Pentaho Support.
- You can develop a shim yourself. Check out the Hadoop Configurations for more information.
Note: For instructions on preparing the shim to connect to a Kerberos cluster, see our Mindtouch documentation here: https://help.pentaho.com/Documentation/5.4/0P0/0W0/030/040.
These steps apply for the PDI and BA Servers as well as the Spoon, Report Designer, and Metadata Editor design tools.
Specify which Hadoop Distribution (shim) you want to make active. You must do this for each Pentaho application that needs access to the Hadoop cluster. Only one Hadoop distribution can be active at a time; so each time you change the distributions or version, you will need to reset the active Hadoop distribution.
- Stop the application (e.g. Spoon, DI Server, Report Design, BA Server, Metadata Editor) if it is running.
- Navigate to the pentaho-big-data-plugin folder. This folder is different for each application and located:
- DI Server - data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
- BA Server - biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
- Spoon - data-integration/plugins/pentaho-big-data-plugin
- Report Designer - report-designer/plugins/pentaho-big-data-plugin
- Metadata Editor - metadata-editor/plugins/pentaho-big-data-plugin
- Edit the plugin.properties file.
- Set the active.hadoop.configuration property to match the name of the shim you want to make active. For example, if the name of the shim is cdh42, then the code would look like this: active.hadoop.configuation=cdh42.
- Save and close the plugin.properties file.
YARN shims require additional configuration. See Additional Configuration for YARN Shims for more detail.
MapR users need to do further configuration as described in Additional Configuration for MapR Shims.
CDH 5 users who want to configure CDH 5 to use MapReduce 1 instead of MapReduce 2, follow the instructions in Additional Configuration for using MR1 with CDH5.
Support for the following Hadoop Configurations are planned for an upcoming patch or future release.
- HDP 2.3
Now that you've configured Pentaho for your Hadoop distribution, there are many things you can do. Here are a few links to get you started!
- Check out how to load data in a Hadoop cluster.
- Learn how to transform data within a cluster.
- Read about how to extract data from a cluster.
- View information on how to report data in Hadoop.
- Learn more about Pentaho MapReduce.
- Explore the Pentaho Infocenter to learn more about Pentaho software.
Want to switch gears and read something a little different? Check out these articles on the evolution of Hadoop.
- Part I and Part II of Genealogy of Elephants
- A brief history of Apache Hadoop branches and releases