How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.
Pentaho supports different versions of Hadoop distributions from several vendors such as Cloudera, Hortonworks, and MapR. To support this many versions, Pentaho uses an abstraction layer, called a shim, that connects to the different Hadoop distributions. A shim is a small library that intercepts API calls and redirects or handles them, or changes the calling parameters. Periodically, Pentaho develops new shims as vendors develop new Hadoop distributions and versions. These big data shims are tested and certified by Pentaho engineers. The following steps will help you get Pentaho set up to work with your Hadoop cluster.
Our support matrix in our documentation lists the currently-supported Hadoop distributions. Make sure that the Hadoop distribution you want to use is supported:
- Support Matrix for 5.2 (
- Support Matrix for CDH 5.2 was added in the December 2015 service pack. See the Pentaho Support Portal to obtain the service pack.)5.3
- Support Matrix for 5.34
NOTE: Pentaho is pre-configured for Apache Hadoop 0.20.2. If you are using this distribution and version, no further configuration is required.
- If the Hadoop Distribution that you want to use is not listed, visit our 5.1/5.2 and 5.0 pages; a previous version of our software might support older Hadoop Distributions. Downloads appear there as well.
- You can request that Pentaho develop a shim for a distribution by contacting sales or filling out a Jira ticket or if you are an EE subscription customer, you can contact Pentaho Support.
- You can develop a shim yourself. Check out the Hadoop Configuration pageConfigurations for more information.
Anchor SetActiveShim SetActiveShim
Note: For instructions on preparing the shim to connect to a Kerberos cluster, see our Mindtouch documentation here: https://help.pentaho.com/Documentation/5.4/0P0/0W0/030/040.
Set Active Hadoop Distribution
YARN shims require additional configuration. See Additional Configuration for YARN Shims for more detail.
MapR users need to do further configuration as described in Additional Configuration for MapR Shims.
CDH 5 users who want to configure CDH 5 to use Map Reduce MapReduce 1 instead of Map Reduce MapReduce 2, follow the instructions in Additional Configuration for using MR1 with CDH5.
Future Release Roadmap
The Support for the following Hadoop Configurations are planned for post 5.3 delivery.an upcoming patch or future release.
- HDP 2.2CDH 5.3
Now that you've configured Pentaho for your Hadoop distribution, there are many things you can do. Here are a few links to get you started!