Hitachi Vantara Pentaho Community Wiki
Child pages
  • Transforming Data within Hive in MapR

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Wiki Markup
{scrollbar}

Excerpt

How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • MapR
  • Pentaho Data Integration
  • Hive

Sample Files

The source data for this guide will reside in a Hive table called weblogs. If you have previously completed the Loading Data into MapR Hive guide, then you can skip to #Create a Database Connection to Hive. If not, then you will need the following datafile and perform the Create a Hive Table] instructions before proceeding.
The sample data file needed for the #Create a Hive Table instructions is:

File Name

Content

weblogs_parse.txt.zip

Tab-delimited, parsed weblog data

...

Code Block
hadoop fs –mkdir /weblogs
hadoop fs –mkdir /weblogs/parse
hadoop fs –put weblogs_parse.txt /weblogs/parse/part-00000

Step-By-Step Instructions

Setup

Start MapR if it is not already running.
Start Hive Server if it is not already running.

Anchor
Create a Hive Table
Create a Hive Table

Create a Hive Table

...

Code Block

create table weblogs (
    client_ip    string,
    full_request_date string,
    day    string,
    month    string,
    month_num int,
    year    string,
    hour    string,
    minute    string,
    second    string,
    timezone    string,
    http_verb    string,
    uri    string,
    http_status_code    string,
    bytes_returned        string,
    referrer        string,
    user_agent    string)
row format delimited
fields terminated by '\t';

...

Code Block

hadoop fs –cp /weblogs/parse/part-00000 /user/hive/warehouse/weblogs/

...

Create a Database Connection to Hive

If you already have a shared Hive Database Connection defined within PDI then this task may be skipped.

...

  1. Connection Name: Enter 'Hive'
  2. Connection Type: Select 'Hadoop Hive'
  3. Host Name and Port Number: Your connection information. For local single node clusters use 'localhost' and port '10000'.
  4. Database Name: Enter 'Default'
    When you are done your window should look like:
    Image Removed
    Click 'Test' to test the connection.
    If the test is successful click 'OK' to close the Database Connection window.

Create a Job to Aggregate Web Log Data into a Hive Table

In this task you will create a job that runs a Hive script to build an aggregate table, weblogs_agg, using the detailed data found in the Hive weblogs table. The new Hive weblogs_agg table will contain a count of page views for each IP address by month and year.

Tip
titleSpeed Tip

You can download the Kettle Job aggregate_hive.kjb already completed

...

Code Block

create table weblogs_agg
as
select
  client_ip
, year
, month
, month_num
, count(*) as pageviews
from weblogs
group by   client_ip, year, month, month_num

...

Check Hive

...

Code Block

select * from weblogs_agg limit 10;

...

Summary

During this guide you learned how to transform data within Hive within a PDI job flow.

Scrollbarinclude
Include Transforming Data within Hive
Include Transforming Data within Hive
Wiki Markup
{scrollbar}