Access Keys:
Skip to content (Access Key - 0)

How to create a report that sources data from Hive.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Hive
  • Report Designer
  • RDBMS Database

Sample Files

The sample data file needed for this guide is:

File Name Content
weblogs_hive.zip
Tab-delimited, aggregated weblog data for a Hive weblogs_agg table


NOTE: If you have already completed the Transforming Data within Hive guide, then the necessary Hive table, weblog_aggs, will already be created and populated.

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.
Start Hive Server if it is not already running.

Create a Hive Table

NOTE: This task may be skipped if you have completed Transforming Data within Hive guide.

  1. Open the Hive Shell: Open the Hive shell so you can manually create a Hive table by entering 'hive' at the command line.
  2. Create the Table in Hive: You need a hive table to load the data to, so enter the following in the hive shell.
    create table weblogs_agg (
        client_ip    string,
        year    string,
        month    string,
        month_num    int,
        pageviews    bigint
    )
    row format delimited;
    
  3. Close the Hive Shell: You are done with the Hive Shell for now, so close it by entering 'quit;' in the Hive Shell.
  4. Load the Table: Load the Hive table by running the following commands:
    hadoop fs –put weblog_hive.txt /user/hive/warehouse/weblogs_agg/
    
    

Create a Hive Report

In this task you will create a report that uses JDBC and HiveQL to report on data from Hive.

  1. Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.
    Speed Tip
    You can download the Report definition HiveReport.prpt already completed
  2. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:

    Click 'Next' to go to the next screen.

  3. Create a Data Source: You need to create the Hive query to select data for this report, so click the plus button and do the following:
    1. Choose Type 'JDBC'
    2. Select 'Sample Query' in the Available Queries box and delete it by clicking the .
    3. Above the Connection box create a new connection by clicking the .
    4. Create a Database Connection: Click the above the connection box and do the following:
      1. Connection Name: Enter 'Hive'.
      2. Connection Type: Select 'Hadoop Hive'.
      3. Host Name and Port Number: Your Hive connection information. For local single node clusters use 'localhost' and port '10000'.
      4. Database Name: Enter 'default'
      5. Leave User Name and Password empty.
      6. Test the connection by pressing the 'Test' button.

        When you are done your window should look like:

        Click 'OK' to close the window.
    5. In the Connections box select 'Hive'.
    6. Create a New Query: You need a query to select the data from Hive for the report, so click the above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'Page Views'.
      2. Query: Enter the following
        Select year, month,month_num, sum(pageviews) as pageviews
        From weblogs_agg
        Group by year, month,month_num
        Order by year, month_num
        


        When you are done your window should look like:

    7. Create a Second Query: You need a second query to get the list of unique IP Addresses so you can use them in your report parameters, so click the above the 'Available Queries' box and do the following:
      1. Query Name: Enter 'IP Addresses'.
      2. Query: Enter the following
        Select distinct client_ip from weblogs_agg
        

        When you are done your window should look like:

        Click 'OK' to close the window.

    8. Highlight the 'Page Views' query in the Report Design Wizard window.
      Your Report Design Wizard window should now look like:

      Click 'Next' to go to the next screen.

  4. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Group Items By: Add 'year'
    2. Selected Items: Add 'month' and 'pageviews' in that order.
      When you are done your window should look like:

      Click 'Finish' to complete to the design wizard.

  5. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'JDBC (Hive)' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP Addresses'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'
      When you are done your window should look like:

      Click 'OK' to close the window.

  6. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'JDBC: Hive', right click on 'Page Views' and select 'Edit Query'.

    Then do the following:
    1. In the Query box add the following line between the From and Group by lines:
      Where client_ip = ${paramIPAddress}
      Note there must be a space before the $.
    2. When you are done your entire query will look like:
      Select year, month,month_num, sum(pageviews) as pageviews
      From weblogs_agg
      Where client_ip = ${paramIPAddress}
      Group by year, month,month_num
      Order by year, month_num
      

      Click 'OK' to close the window.

  7. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hive_report.prpt' into a folder of your choice.

  8. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:


  9. Do any desired formatting to the report.

Summary

In this guide you learned how to construct a report that sources data from Hive.


This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki