Hitachi Vantara Pentaho Community Wiki
Child pages
  • 5.0 Configuring Pentaho for your Hadoop Distro and Version
Skip to end of metadata
Go to start of metadata

How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.

This page applies to Pentaho Suite version 5.0 only, for 5.1 go here
For Kettle 4.4 (or Pentaho BA suite 4.8) see this page

Pentaho supports different versions of Hadoop distributions from many vendors such as Apache, Cloudera, DataStax, Hortonworks, Intel, and MapR. How can Pentaho support so many Hadoop distributions? The secret is that Pentaho uses an abstraction layer, called a shim, that connects to the different Hadoop distributions. A shim is a small library that intercepts API calls and either redirects or handles them, or changes the calling parameters. Periodically, Pentaho develops new shims as vendors develop new Hadoop distributions and versions. These big data shims are tested and certified by Pentaho engineers. The following Steps will help you get Pentaho set up to work with your Hadoop cluster.

Determine the proper shim for your Hadoop Distro and version

Pentaho is pre-configured for Apache Hadoop 0.20.2. If you are using this distribution and version, no further configuration is required.

In the following table, click the tab of the Hadoop distribution that you are interested in, then locate the version of the distribution you want to use. Note the name of the corresponding shim and the minimum version of the Pentaho software that supports it.

For example, if you want to use the Cloudera's CDH 4.2.1, click the Cloudera tab, then look in the Hadoop version column. CDH4.2.x is supported with shim cdh42. You need to have Pentaho Business Analytics (or Pentaho Data Integration) version 5.0 or later installed to use this shim.


Pentaho Shim Support Matrix

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    0.20.x

    hadoop-20

    5.0

    included

     

    1.0.x

    NS*

     

     

    Planned see PDI-10984

    1.1.x

    NS*

     

     

    Not likely to be done in favor of 1.2.x PDI-9964

    1.2.x

    NS*

     

     

    Possibly in patch post 5.0 but not committed http://jira.pentaho.com/browse/PDI-10393

    2.x.x

    NS*

     

     

    Distro is Alpha

    Go to Apache releases

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    CDH4.0, 4.0.1, 4.1, 4.1.1

    cdh4

    5.0

    download

    The cdh42 shim also supports this configuration

    CDH4.1.2

    cdh412

    5.0

    download

    The cdh42 shim also supports this configuration

    CDH4.1.3

    cdh413

    5.0

    download

    The cdh42 shim also supports this configuration

    CDH4.2.x

    cdh42

    5.0

    included

    Backward compatible with all earlier cdh4.x distros

    CDH4.3 - CDH4.6

    cdh42

    5.0

    included

     

    CDH4.7

    ++cdh42

    5.0

    included

    ++Not yet QA tested but minor releases rarely have issues PDI-12313

    CDH5

    cdh50

    **5.0.4

    included with 5.0.6

     

    Go to Cloudera releases

    *NOTE: the cdh42 shim supports all versions of CDH from 4.0 through 4.6.x

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    DSE 3.0.x

    NS*

     

     

    Possibly in patch post 5.0 but not committed PDI-8036

    DSE 2.2.x

    NS*

     

     

    No current plans to support

    Go to DataStax releases

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    HDP 1.2.x

    hdp12

    4.8 + BD Plugin 1.3.2+

    download

     

    HDP 1.3.x

    hdp13

    4.8 + BD Plugin 1.3.2+

    included

     

    HDP 1.3 for Win

    *NS

     

     

    On hold, testing and support is waiting for customer demand. Vote here: PDI-10266

    HDP 2.0

    hdp20

    **5.0.4

    included

     

    HDP 2.1

    *NS

     

     

    Support Planned for 5.1 PDI-11582

    Go to Hortonworks releases

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    IDH 2.3

    idh23

    4.8 + BD Plugin 1.3.2+

    download

     

    Go to Intel releases

    Hadoop Version

    Shim

    Pentaho Suite Ver+

    Download

    Notes

    1.1.3, 1.2.0

    mapr

    4.8+

    download

     

    2.0.x

    NS*

     

     

    No Support planned PDI-9648

    2.1.x

    mapr21

    4.8 + BD Plugin 1.3.2+

    included

     

    3.0.x

    mapr30

    **5.0.4

    included with 5.0.4

     

    3.1.x

    mapr31

    **5.0.4

    included with 5.0.6

    This shim is EE only and must be downloaded from Pentaho Support

    4.0.x

    *NS

     

     

    MapR 4.0 is in Beta, Support planned PDI-12091

    Go to MapR releases

    deck: com.atlassian.confluence.macro.MacroExecutionException: java.lang.NullPointerException

    * NS - Not supported. See Hadoop Configurations for information on how to create or modify a shim to support your configuration

    + Pentaho Suite Ver is the earliest version of the Pentaho suite that supports this shim. Subsequent Pentaho versions will also support this shim unless otherwise noted.

    ** 5.0.4 - Only supported with Big Data Plugin 5.0.4 or later. EE Customers can upgrade to 5.0.4 by going to support.pentaho.com CE Users can upgrade by following the Upgrade Hadoop in Community Edition to 5.0.4 instructions.


    If the Hadoop distribution you want is supported but not installed by default

    You need to download the shim from our support site, click the Download link for the shim in the above table. For a list of all available shims, go to 5.0.0 shims for version 5.0.0 to 5.0.3 and 5.0.4 shims. We recommend that you upgrade to 5.0.4 if possible

    Go to Install Hadoop Distribution Shim for instructions on how to install the shim.

    If the Hadoop distribution you want is not listed as supported in the table.

    • Look at the table of Jira cases below. The shim you want might be scheduled for development, but not yet released.
    • You can request that Pentaho develop one, fill out a Jira ticket.
    • It is possible to develop it yourself, check out the Hadoop Configuration page for more information.

    Set Active Hadoop Distribution

    These steps apply to DI and BA Servers as well as the design tools Spoon, Report Designer, and Metadata Editor.

    Specify which Hadoop Distribution (shim) you want to make active. You must do this for each Pentaho application that needs access to the Hadoop cluster. Only one distribution can be active at a time; so each time you change the distributions or version, you will need to reset the active Hadoop distribution.

    1. Stop the application (e.g. Spoon, DI Server, Report Design, BA Server, Metadata Editor) if it is running.
    2. Navigate to the pentaho-big-data-plugin folder. This folder is different for each application and located:
      • DI Server - data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
      • BA Server - biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
      • Spoon - data-integration/plugins/pentaho-big-data-plugin
      • Report Designer - report-designer/plugins/pentaho-big-data-plugin
      • Metadata Editor - metadata-editor/plugins/pentaho-big-data-plugin
    3. Edit the plugin.properties file.
    4. Set the active.hadoop.configuration property to match the name of the shim you want to make active. For example, if the name of the shim is cdh42, then the code would look like this: active.hadoop.configuation=cdh42.
    5. Save and close the plugin.properties file.

    YARN shims require additional configuration.  See Additional Configuration for YARN Shims for more detail. 

    MapR users need to do further configuration as described in Additional Configuration for MapR Shims.

    CDH 5 users who want to configure CDH 5 to use Map Reduce 1 instead of Map Reduce 2, follow the instructions in Additional Configuration for using MR1 with CDH5.

    Open JIRA Cases for Hadoop Distribution Support

    key fixVersion summary status assignee updated

    Data cannot be retrieved due to an unexpected error.

    View these issues in JIRA

    Next Steps

    Now that you've configured Pentaho for your Hadoop distribution, there are many things you can do.  Here are a few links to get you started!

    Want to switch gears and read something a little different? Check out these articles on the evolution of Hadoop.

    • No labels