Access Keys:
Skip to content (Access Key - 0)

How to use a PDI transformation to extract data from Hive and load it into a RDBMS table. The new RDBMS table will contain the count of page views by IP address and month.

Note
For brevity's sake, this transformation will only contain two steps: Table Input and a Table Output. In practice, the full expressiveness of the PDI transformation semantic is available. Further, PDI supports bulk loading many RDBMS and that would be a viable, and common, alternative to using a Table Output approach.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Hive

Sample Files

The source data for this guide will reside in a Hive table called weblogs. If you have previously completed the "Loading Data into Hive" guide, then you can skip to "Create a Database Connection to Hive". You do not have to load the following sample data.
The sample data file needed for the "Create a Hive Table" instructions is:

File Name Content
weblogs_parse.txt.zip Tab-delimited, parsed weblog data


NOTE: If you have previously completed the "Using Pentaho MapReduce to Parse Weblog Data" guide, then the necessary files will already be in the proper location.
This file should be placed in the /weblogs/parse directory of the HDFS using the following commands.

hadoop fs –mkdir /user/pdi/weblogs
hadoop fs –mkdir /user/pdi/weblogs/parse
hadoop fs –put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.
Start Hive Server if it is not already running.

Create a Hive Table

NOTE: This task may be skipped if you have completed the "Loading Data into Hive" guide.

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.

  2. Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.
    create table weblogs (
        client_ip    string,
        full_request_date string,
        day    string,
        month    string,
        month_num int,
        year    string,
        hour    string,
        minute    string,
        second    string,
        timezone    string,
        http_verb    string,
        uri    string,
        http_status_code    string,
        bytes_returned        string,
        referrer        string,
        user_agent    string)
    row format delimited
    fields terminated by '\t';
    
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.

  4. Load the Table: Load the Hive table by running the following commands:
    hadoop fs –put part-00000.txt /user/hive/warehouse/weblogs/

Create a Database Connection to Hive

If you already have a shared Hive Database Connection defined within PDI then this task may be skipped.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.

  2. Create a New Connection: In the View Palette right click on 'Database connections' and select 'New'.


  3. Configure the Connection: In the Database Connections window enter the following:
    1. Connection Name: Enter 'Hive'
    2. Connection Type: Select 'Hadoop Hive'
    3. Host Name and Port Number: Your connection information. For local single node clusters use 'localhost' and port '10000'.
    4. Database Name: Enter 'Default'
      When you are done your window should look like:

      Click 'Test' to test the connection.
      If the test is successful click 'OK' to close the Database Connection window.
  4. Share the Hive Database Connection: You will want to use your Hive database connection in future transformations, so share the connection by expanding 'Database Connections' in the View Palette, right clicking on 'Hive', and selecting 'Share'.
    Sharing the connection will prevent you from having to recreate the connection every time you want to access the Hive database in a transformation.

Create a RDBMS Connection

In this task you will create a connection to a RDBMS that you will use throughout this guide. This task uses a MySQL connection, but you may use any database you wish that has a JDBC driver.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.

  2. Create a Database Connection: You need to create a database connection to your RDBMS, so right click on the 'Database connections' in the View palette and select New. If you already have a database connection for your RDBMS database you may skip this step.
    The Database Connection window will appear. Enter the following information:
    1. Connection Name: Enter 'RDBMS'
    2. Connection Type: Select 'MySQL'
    3. Host Name and Port Number: Your connection information for the MySQL Server. For a local MySQL database Host Name is 'localhost' and Port Number is '3306'.
    4. Database Name: Enter your database name. For a local MySQL database use 'test'.  If you have not already created a database named 'test' on MySQL, please do so now.
    5. User Name and Password: Your database username and password.
      When you are done your window should look like:

      Notice that there are lots of connection types that you could have used.
      Click 'Test' to verify your connection is working properly. If the test fails verify your RDBMS server is running and you have entered the correct connection information.

      Click 'OK' to close the Database Connection window.

  3. Share the RDBMS Connection: You will want to use your RDBMS connection in future transformations, so share the connection by expanding 'Database Connections' in the View Palette, right clicking on 'RDBMS', and selecting 'Share'.
    Sharing the connection will prevent you from having to recreate the connection every time you want to access the RDBMS in a transformation.

Create a Transformation to Extract Data from Hive

In this task you will create a transformation to extract data from Hive and load into a RDBMS table.

Speed Tip
You can download the Kettle Transformation hive_to_rdbms.ktr already completed
  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
  2. Add a Table Input Step: You are going to extract data from Hive using HiveQL, so expand the 'Input' section of the Design palette and drag a 'Table Input' node onto the transformation canvas. Your transformation should look like:


  3. Edit the Table Input Step: Double-click on the 'Table Input' node to edit its properties. Enter this information:
    1. Connection: Select 'Hive'
    2. SQL: Enter the following
      select client_ip, year, month, month_num, count(*) as pageviews
      from weblogs
      group by client_ip, year, month, month_num


      When you are done your 'Map/Reduce Input' window should look like this:

      Click 'OK' to close the window.

  4. Add a Table Output Step: You are going to write the data to a database, so expand the 'Output' section of the Design palette and drag a 'Table Output' node onto the transformation canvas. Your transformation should look like:

    Notice that there are lots of other output formats you can use such as Text file, XML, and Microsoft Excel.

  5. Connect the Input and Output steps: Hover the mouse over the 'Table Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Table Output' node. Your canvas should look like this:


  6. Edit the Table Output Step: Double-click on the 'Table Output' node to edit its properties. Do the following:
  7. Connection: Select 'RDBMS'
  8. Target Table: Enter 'aggregate_hive'
  9. Check 'Truncate table' so you can re-run this transformation.
  10. Click the 'SQL' button to create the table in your target database.
  11. Click the 'Execute' button to run the SQL.

    The 'Results of the SQL statements' window will appear telling you if the SQL was succesffully executed or give you troubleshooting information if the SQL failed.
  12. Click 'OK' to close the window.
  13. Click 'Close' to close the 'Simple SQL editor' window.
    When you are done your window should look like:

    Click 'OK' to close the window.

  14. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hive_to_rdbms.ktr' into a folder of your choice.

  15. Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:
    If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

    Check RDBMS for Hive Aggregated Table

  16. Explore the Database: Choose 'Tools' -> 'Database' -> 'Explore' from the menu system.

  17. Select the Connection: In the 'Make your selection' window select 'RDBMS' and click 'OK'.


  18. Preview the Table: Expand RDBMS -> Tables. Right click on 'aggregate_hive' and select 'Preview first 100'

Summary

During this guide you learned how to create a PDI transformation that sources data from Hive and loads into an RDBMS table.


This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki