This section contains a series of How-Tos that demonstrate the integration between Pentaho and Hadoop using a sample weblog dataset.
The how-tos are organized by topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a Hadoop cluster. You are encouraged to perform the how-tos in order as the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.
Hadoop Topics
- Configuring Pentaho for your Hadoop Distro and Version — How to set up and configure Kettle for your specific Hadoop distribution.
- Loading Data into a Hadoop Cluster — How to load data into HDFS (Hadoop's Distributed File System), Hive and HBase.
- Loading Data into HDFS — How to use a PDI job to move a file into HDFS.
- Loading Data into Hive — How to use a PDI job to load a data file into a Hive table.
- Loading Data into HBase — How to use a PDI transformation that sources data from a flat file and writes to an HBase table.
- Transforming Data within a Hadoop Cluster — How to transform data within the Hadoop cluster using Pentaho MapReduce, Hive, and Pig.
- Using Pentaho MapReduce to Parse Weblog Data — How to use Pentaho MapReduce to convert raw weblog data into parsed, delimited records.
- Using Pentaho MapReduce to Generate an Aggregate Dataset — How to use Pentaho MapReduce to transform and summarize detailed data into an aggregate dataset.
- Transforming Data within Hive — How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.
- Transforming Data with Pig — How to invoke a Pig script from a PDI job.
- Extracting Data from the Hadoop Cluster — How to extract data from Hadoop using HDFS, Hive, and HBase.
- Extracting Data from HDFS to Load an RDBMS — How to use a PDI transformation to extract data from HDFS and load it into a RDBMS table.
- Extracting Data from Hive to Load an RDBMS — How to use a PDI transformation to extract data from Hive and load it into a RDBMS table.
- Extracting Data from HBase to Load an RDBMS — How to use a PDI transformation to extract data from HBase and load it into a RDBMS table.
- Extracting Data from Snappy Compressed Files — How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.
- Reporting on Data in Hadoop — How to report on data that is resident within the Hadoop cluster.
- Reporting on HDFS File Data — How to create a report that sources data from a HDFS file.
- Reporting on HBase Data — How to create a report that sources data from HBase.
- Reporting on Hive Data — How to create a report that sources data from Hive.
- Unit Test Pentaho MapReduce Transformation — How to unit test the mapper and reducer transformations that make up a Pentaho MapReduce job.
- Simple Chrome Extension to browse HDFS volumes — How to add a Chrome Omnibox extension to support HDFS browsing.
- Advanced Pentaho MapReduce — Advanced how-tos for developing Pentaho MapReduce.
- Using Compression with Pentaho MapReduce — How to use compression with Pentaho MapReduce.
- Using a Custom Partitioner in Pentaho MapReduce — How to use a custom partitioner in Pentaho MapReduce.
- Using a Custom Input or Output Format in Pentaho MapReduce — How to use a custom Input or Output Format in Pentaho MapReduce.
- Processing HBase data in Pentaho MapReduce using TableInputFormat — How to use HBase TableInputFormat in Pentaho MapReduce.
Pre-Requisites
In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips.
Hadoop
A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for Hadoop. A nice example how to set up a single node: http://hadoop.apache.org/common/docs/current/single_node_setup.html
Pentaho Data Integration
PDI will be the primary development environment for the how-tos. You will need version 4.3 or above. You can find instructions to install PDI for Hadoop in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.
Pentaho Hadoop Distribution
A Hadoop node distribution of the Pentaho Data Integration (PDI) tool. Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the Hadoop cluster.
You can find instructions to download and install the software here: Configure Pentaho for Cloudera and Other Hadoop Versions
Pentaho Report Designer
A desktop installation of Pentaho Report Designer tool called with the PDI jars in the lib directory.
You can find instructions to download and install report designer in the Configure Pentaho for Cloudera and Other Hadoop Versions guide.
Hive
A supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to Hadoop data.
You can find a Hive Getting Started guide here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
HBase
A MapR supported version of HBase. HBase is a NoSQL database that leverages Hadoop storage.
Sample Data
The how-to’s in this guide were built with sample weblog data. The following files which are used and/or generated by the how-to’s in this guide. Each specific how-to will explain which file(s) it requires.
| File Name | Content |
| weblogs_rebuild.txt.zip | Unparsed, raw weblog data |
| weblogs_parse.txt.zip | Tab-delimited, parsed weblog data |
| weblogs_hive.txt.zip | Tab-delimited, aggregated weblog data for a Hive weblogs_agg table |
| weblogs_aggregate.txt.zip | Tab-delimited, aggregated weblog data |
| weblogs_hbase.txt.zip | Prepared data for HBase load |