How to read data from a Hive table, transform it, and write it to a Hive table within the workflow of a PDI job.
In order follow along with this how-to guide you will need the following:
- Pentaho Data Integration
The source data for this guide will reside in a Hive table called weblogs. If you have previously completed the Loading Data into MapR Hive guide, then you can skip to #Create a Database Connection to Hive. If not, then you will need the following datafile and perform the Create a Hive Table] instructions before proceeding.
The sample data file needed for the #Create a Hive Table instructions is:
Tab-delimited, parsed weblog data
hadoop fs –mkdir /weblogs hadoop fs –mkdir /weblogs/parse hadoop fs –put weblogs_parse.txt /weblogs/parse/part-00000
Start MapR if it is not already running.
Start Hive Server if it is not already running.
Create a Hive Table
create table weblogs ( client_ip string, full_request_date string, day string, month string, month_num int, year string, hour string, minute string, second string, timezone string, http_verb string, uri string, http_status_code string, bytes_returned string, referrer string, user_agent string) row format delimited fields terminated by '\t';
hadoop fs –cp /weblogs/parse/part-00000 /user/hive/warehouse/weblogs/
Create a Database Connection to Hive
If you already have a shared Hive Database Connection defined within PDI then this task may be skipped.
- Connection Name: Enter 'Hive'
- Connection Type: Select 'Hadoop Hive'
- Host Name and Port Number: Your connection information. For local single node clusters use 'localhost' and port '10000'.
- Database Name: Enter 'Default'
When you are done your window should look like:
Click 'Test' to test the connection.
If the test is successful click 'OK' to close the Database Connection window.
Create a Job to Aggregate Web Log Data into a Hive Table
In this task you will create a job that runs a Hive script to build an aggregate table, weblogs_agg, using the detailed data found in the Hive weblogs table. The new Hive weblogs_agg table will contain a count of page views for each IP address by month and year.
You can download the Kettle Job aggregate_hive.kjb already completed
create table weblogs_agg as select client_ip , year , month , month_num , count(*) as pageviews from weblogs group by client_ip, year, month, month_num
select * from weblogs_agg limit 10;
During this guide you learned how to transform data within Hive within a PDI job flow.