Hitachi Vantara Pentaho Community Wiki
Child pages
  • Configure Pentaho for Cloudera and Other Hadoop Versions
Skip to end of metadata
Go to start of metadata

Outdated Material

This page has outdated material from an old Kettle version - 4.3 and has been archived. If you found it via search, be sure that it is what you want.

Client Configuration

These instructions are for Hadoop distros other than MapR.  If you are using MapR go to the Configure Pentaho for MapR page.  If you are using Cloudera CDH4 MRv1 go to the Configure Pentaho for Cloudera CDH4 page.

Kettle Client

  1. Download and extract Kettle CE from the Downloads page.
    The Kettle Client comes pre-configured for Apache Hadoop 0.20.2. If you are using this distro and version, no further configuration is required.
  2. Configure PDI Client for a different version of Hadoop
    1. Delete $PDI_HOME/libext/bigdata/hadoop-0.20.2-core.jar
    2. For all other distributions you should replace this core jar with the one from your cluster. For example, if you are using Cloudera CDHu3:
      Copy $HADOOP_HOME/hadoop-core-0.20.2-cdh3u3.jar to $PDI_HOME/libext/bigdata
    3. For Hadoop 0.20.205 you also need to have Apache Commons Configuration included in your set of PDI libraries.  In that case copy commons-configuration-1.7.jar to $PDI_HOME/libext/commons
    4. For Cloudera CDH3 Update 3 you also need to copy $HADOOP_HOME/lib/guava-r09-jarjar.jar to $PDI_HOME/libext/bigdata.
  3. Apply Hadoop client configuration files by placing the core-site, hdfs-site, and mapred-site.xml files in the $PDI_HOME directory.

Pentaho Report Designer (PRD)

  1. Download and extract PRD from the Downloads page.
    PRD comes pre-configured for Apache Hadoop 0.20.2. If you are using this distro and version, no further configuration is required.
  2. Configure PRD for a different version of Hadoop
    1. Delete $PRD_HOME/lib/bigdata/hadoop-0.20.2-core.jar
    2. Copy $HADOOP_HOME/hadoop-core.jar from your distribution into $PRD_HOME/lib/bigdata
    3. For Hadoop 0.20.205 you also need to have Apache Commons Configuration included in your set of PDI libraries.  In that case copy commons-configuration-1.7.jar to $PRD_HOME/lib/bigdata
    4. For Cloudera CDH3 Update 3 you also need to copy $HADOOP_HOME/lib/guava-r09-jarjar.jar to $PRD_HOME/lib.

Pentaho BI Server

  1. Download and extract The BI Server from the Downloads page.
    The BI Server comes pre-configured for Apache Hadoop 0.20.2. If you are using this distro and version, no further configuration is required.
  2. Configure BI Server for a different version of Hadoop
    1. Delete $BI_SERVER_HOME/tomcat/webapps/pentaho/WEB-INF/lib/hadoop-0.20.2-core.jar
    2. Copy $HADOOP_HOME/hadoop-core.jar from your distribution into $BI_SERVER_HOME/tomcat/webapps/pentaho/WEB-INF/lib/
    3. For Hadoop 0.20.205 you also need to have Apache Commons Configuration included in your set of PDI libraries.  In that case copy commons-configuration-1.7.jar to $BI_SERVER_HOME/tomcat/webapps/pentaho/WEB-INF/lib
    4. For Cloudera CDH3 Update 3) you also need to copy $HADOOP_HOME/lib/guava-r09-jarjar.jar to $PDI_HOME/libext/pentaho.
  3. Place the Hadoop configuration files (hdfs-site.xml, core-site.xml, mapred-site.xml) into $BI_SERVER_HOME/tomcat/webapps/pentaho/WEB-INF/classes
    Unknown macro: {HTMLComment}

    Known Issues

    1. When using the HBase Input or Output steps from within a Pentaho MapReduce job you must have the HBase jar on the HADOOP_CLASSPATH on each node running a TaskTracker.
    2. When using CDH3u1 and above the Hive JDBC driver fails retrieving the last row. You must replace the Hive JDBC driver in Kettle's libext/bigdata/JDBC directory and PRD's lib/jdbc with this driver hive-jdbc-0.7.0-pentaho-SNAPSHOT.jar.
  • No labels