This section contains a series of How-Tos that demonstrate the integration between Pentaho and MapR using a sample weblog dataset.
The how-tos are organized by topic with each set explaining various techniques for loading, transforming, extracting and reporting on data within a MapR cluster. You are encouraged to perform the how-tos in order as the output of one is sometimes used as the input of another. However, if you would like to jump to a how-to in the middle of the flow, instructions for preparing input data are provided.
- Loading Data into a MapR Cluster — How to load data into CLDB (MapR’s distributed file system), Hive and HBase.
- Loading Data into the MapR filesystem — How to use a PDI job to move a file into the MapR filesystem.
- Loading Data into MapR Hive — How to use a PDI job to load a data file into a Hive table.
- Loading Data into MapR HBase — How to use a PDI transformation that sources data from a flat file and writes to an HBase table.
- Transforming Data within a MapR Cluster — How to leverage the massively parallel, fault tolerant MapR processing engine to transform resident cluster data.
- Using Pentaho MapReduce to Parse Weblog Data in MapR — How to use Pentaho MapReduce to convert raw weblog data into parsed, delimited records.
- Using Pentaho MapReduce to Generate an Aggregate Dataset in MapR — How to use Pentaho MapReduce to transform and summarize detailed data into an aggregate dataset.
- Transforming Data within Hive in MapR — How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.
- Transforming Data with Pig in MapR — How to invoke a Pig script from a PDI job.
- Extracting Data from the MapR Cluster — How to extract data from the MapR cluster and load it into an RDBMS table.
- Extracting Data from CLDB to Load an RDBMS — How to use a PDI transformation to extract data from MapR CLDB and load it into a RDBMS table.
- Extracting Data from Hive to Load an RDBMS in MapR — How to use a PDI transformation to extract data from Hive and load it into a RDBMS table.
- Extracting Data from HBase to Load an RDBMS in MapR — How to use a PDI transformation to extract data from HBase and load it into a RDBMS table.
- Reporting on Data in the MapR Cluster — How to report on data that is resident within the MapR cluster.
In order to perform all of the how-tos in this section, you will need the following. Since not every how-to uses every component (e.g. HBase, Hive, ReportDesigner), specific component requirements will be identified within each how-to. This section enumerates all of the components with some additional configuration and installation tips.
A single-node local cluster is sufficient for these exercises but a larger and/or remote configuration will also work. You will need to know the addresses and ports for MapR.
These guides were developed using the MapR M3 distribution version 1.2. You can find MapR downloads here: http://mapr.com/download
A Hadoop node distribution of the Pentaho Data Integration (PDI) tool. Pentaho Hadoop Distribution (referred to as PHD from this point on) allows you to execute Pentaho MapReduce jobs on the MapR cluster. Download here and configure instructions are here
Pentaho Report Designer (PRD) is a desktop tool for creating highly formatted reports that can be exported to many popular formats. Reports created with PRD can be published to a Pentaho BI Server so they can be accessed using a browser. Download here and configure instructions are here
A MapR supported version of Hive. Hive is a Map/Reduce abstraction layer that provides SQL-like access to MapR data.
You can find instructions to install Hive for MapR here: http://mapr.com/doc/display/MapR/Hive
A MapR supported version of HBase. HBase is a NoSQL database that leverages the MapR filesystem.
You can find instructions to install HBase for MapR here: http://mapr.com/doc/display/MapR/HBase
The how-to’s in this guide were built with sample weblog data. The following files which are used and/or generated by the how-to’s in this guide. Each specific how-to will explain which file(s) it requires.
|weblogs_rebuild.txt.zip||Unparsed, raw weblog data|
|weblogs_parse.txt.zip||Tab-delimited, parsed weblog data|
|weblogs_hive.txt.zip||Tab-delimited, aggregated weblog data for a Hive weblogs_agg table|
|weblogs_aggregate.txt.zip||Tab-delimited, aggregated weblog data|
|weblogs_hbase.txt.zip||Prepared data for HBase load|