Hitachi Vantara Pentaho Community Wiki
Access Keys:
Skip to content (Access Key - 0)

Processing HBase data in Pentaho MapReduce using TableInputFormat

How to use HBase TableInputFormat in Pentaho MapReduce.

This guide explains how to configure Pentaho MapReduce to use the TableInputFormat for reading data from HBase and how to configure a map-reduce transformation to process that data using the HBaseRowDecoder step.

Prerequisites

In order to follow along with this how-to guide you will need the following:

  • HBase
  • Hadoop configured to access HBase
  • Pentaho Data Integration

Step-By-Step Instructions

Using HBaseRowDecoder

The HBaseRowDecoder step is designed specifically for use in map-reduce transformations in order to decode the key and value data that is output by the TableInputFormat. The key output is the row key from HBase and the value is an HBase "Result" object containing all the column values for the row in question.

First configure a Pentaho MapReduce input step by specifying that both the incoming key and value fields have type "Serializable".

Next specify the incoming row key and HBase result fields in the HBaseRowDecoder step.

Finally, define or load a mapping using the Mapping editor tab.

Once defined (or loaded), this mapping is encapsulated in the transformation meta data.

Configure the Pentaho MapReduce Job Entry Step

To ensure that input splits are created using the TableInputFormat, configure the Input Format and Input Path fields of the Job Setup tab as shown in the following screenshot.

The following table shows various properties that can be supplied in the User Defined tab of the step in order to configure the scan performed by the TableInputFormat. Entries shown in bold are mandatory.

Property Description
hbase.mapred.inputtable Name of the HBase table to read from
hbase.mapred.tablecolumns
Space delimited list of columns in ColFam:ColName format (ColName can be ommitted to read all columns from a family)
hbase.mapreduce.scan.cachedrows
Number of rows for caching that will be passed to scanners
hbase.mapreduce.scan.timestamp
Time stamp used to filter columns with a specific time stamp
hbase.mapreduce.scan.timerange.start
Starting time stamp to filter in a given time stamp range
hbase.mapreduce.scan.timerange.end
End time stamp to filter in a given time stamp range

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki