Hitachi Vantara Pentaho Community Wiki
Child pages
  • Transforming Data within Hive in MapR

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.



How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.


In order follow along with this how-to guide you will need the following:

  • MapR
  • Pentaho Data Integration
  • Hive

Sample Files

The source data for this guide will reside in a Hive table called weblogs. If you have previously completed the Loading Data into MapR Hive guide, then you can skip to #Create a Database Connection to Hive. If not, then you will need the following datafile and perform the Create a Hive Table] instructions before proceeding.
The sample data file needed for the #Create a Hive Table instructions is:

File Name


Tab-delimited, parsed weblog data


Code Block
hadoop fs –mkdir /weblogs
hadoop fs –mkdir /weblogs/parse
hadoop fs –put weblogs_parse.txt /weblogs/parse/part-00000

Step-By-Step Instructions


Start MapR if it is not already running.
Start Hive Server if it is not already running.

Create a Hive Table
Create a Hive Table

Create a Hive Table


Code Block

create table weblogs (
    client_ip    string,
    full_request_date string,
    day    string,
    month    string,
    month_num int,
    year    string,
    hour    string,
    minute    string,
    second    string,
    timezone    string,
    http_verb    string,
    uri    string,
    http_status_code    string,
    bytes_returned        string,
    referrer        string,
    user_agent    string)
row format delimited
fields terminated by '\t';


Code Block

hadoop fs –cp /weblogs/parse/part-00000 /user/hive/warehouse/weblogs/


Create a Database Connection to Hive

If you already have a shared Hive Database Connection defined within PDI then this task may be skipped.


  1. Connection Name: Enter 'Hive'
  2. Connection Type: Select 'Hadoop Hive'
  3. Host Name and Port Number: Your connection information. For local single node clusters use 'localhost' and port '10000'.
  4. Database Name: Enter 'Default'
    When you are done your window should look like:
    Image Removed
    Click 'Test' to test the connection.
    If the test is successful click 'OK' to close the Database Connection window.

Create a Job to Aggregate Web Log Data into a Hive Table

In this task you will create a job that runs a Hive script to build an aggregate table, weblogs_agg, using the detailed data found in the Hive weblogs table. The new Hive weblogs_agg table will contain a count of page views for each IP address by month and year.

titleSpeed Tip

You can download the Kettle Job aggregate_hive.kjb already completed


Code Block

create table weblogs_agg
, year
, month
, month_num
, count(*) as pageviews
from weblogs
group by   client_ip, year, month, month_num


Check Hive


Code Block

select * from weblogs_agg limit 10;



During this guide you learned how to transform data within Hive within a PDI job flow.

Wiki Markup
Include Page
BAD:Include Transforming Data within Hive in MapR
BAD:Include Transforming Data within Hive in MapR