Access Keys:
Skip to content (Access Key - 0)

How to create a report that sources data from a HDFS file. You will use weblog data to build a report that displays a count of pageviews by month for an IP Address.

In order to report on HDFS file data in a parameterized fashion, you will need to perform the following steps:

  • Create a PDI Transformation that sources the picklist of distinct IP Address to be used for a report parameter
  • Create a PDI Transformation that sources a set of weblog data for a selected IP Address. This will be the primary data source for the report
  • Create a Report that uses the PDI transformations for parameter list and report data.

Prerequisites

In order follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration
  • Report Designer

Sample Files

The sample data file needed for this guide is:

File Name Content
weblogs_aggregate.txt.zip Tab-delimited, aggregated weblog data


NOTE: If you have already completed the Using Pentaho MapReduce to Generate an Aggregate Dataset guide, then the necessary files will already be in the proper location.
This file should be placed into the HDFS at /weblogs/aggregate_mr using the following commands.

hadoop fs \-mkdir /user/pdi/weblogs
hadoop fs \-mkdir /user/pdi/weblogs/aggregate_mr
hadoop fs \-put weblogs_aggregate.txt /user/pdi/weblogs/aggregate_mr/part-00000

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Create a Parameter Picklist PDI Transformation

In this task you will create a PDI transformation to get a list of distinct IP Addresses from HDFS data. This transformation will later be used as the data source for a report parameter picklist.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
    Speed Tip
    You can download the Kettle Transform hdfs_ip_list.ktr already completed
  2. Add a Hadoop File Input Step: You are going to read data from a HDFS file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:


  3. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr'
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:


  4. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'
      When you are done your screen should look like:



  5. Define the Fields: Switch to the 'Fields' tab and enter the following:
    Name Type Length
    client_ip String 15
    year Integer  
    month_num Integer  
    pageviews Integer  


    When you are done your window should look like:


    Click 'OK' to close the window.

  6. Add a Sort Rows Step: You need to sort the HDFS file, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:


  7. Connect the Input and Sort Steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. Your canvas should look like this:


  8. Edit the Sort Step: Double-click on the 'Sort rows' node to edit its properties. Enter this information:
    1. Check 'Only pass unique rows? (verifies keys only)'
    2. Fields: Add 'client_ip' sorted in ascending order.
      When you are done your window should look like this:

      Click 'OK' to close the window.
  9. Add a Dummy Step: You need a component for the report to select it's data from, so expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Your transformation should look like:


  10. Connect the Sort and Dummy steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Dummy (do nothing)' node. Your canvas should look like this:


  11. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node to edit its properties. Set the Step name to 'Output'. When you are done your window should look like:

    Click 'OK' to close the window.

  12. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_ip_list.ktr' into a folder of your choice.

Create a HDFS Query PDI Transformation

In this task you will create a PDI transformation to get the data for your report from HDFS. This transformation will later be used as the primary datasource for the report.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
    Speed Tip
    You can download the Kettle Transform hdfs_to_report.ktr already completed
  2. Add a Parameter: Right click on the transformation and select 'Transformation settings'.


    Then do the following:
    1. Switch to the 'Parameters' tab.
    2. Parameter: 'paramIPAddress'
    3. Default Value: '127.0.0.1'

      When you are done your window should look like:


      Click 'OK' to close the window.

  3. Add a Hadoop File Input Step: You are going to read data from a HDFS file, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop File Input' node onto the transformation canvas. Your transformation should look like:

  4. Edit the Hadoop File Input Step: Double-click on the 'Hadoop File Input' node to edit its properties. Enter this information:
    1. File or directory: Enter 'hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/aggregate_mr'
    2. Regular Expression: Enter 'part.*'
    3. Click the 'Add' button.
      When you are done your window should look like this:

  5. Define the File Content: Switch to the 'Content' tab and enter the following:
    1. Clear the Separator field and click the 'Insert TAB' button.
    2. Uncheck 'Header'
    3. Format: Select 'Unix'

      When you are done your screen should look like:

  6. Define the Fields: Switch to the 'Fields' tab and enter the following:
    Name Type Length
    client_ip String 15
    year Integer  
    month_num Integer  
    pageviews Integer  


    When you are done your window should look like:


    Click 'OK' to close the window.

  7. Add a Get Variables Step: You need to add the parameter you created earlier to your stream, so expand the 'Job' section of the Design palette and drag a 'Get Variables' node onto the transformation canvas. Your transformation should look like:

  8. Connect the Input and Get Variables steps: Hover the mouse over the 'Hadoop File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Get Variables' node. Your canvas should look like this:

  9. Edit the Get Variables Step: Double-click on the 'Get Variables' node to edit its properties. Enter this information:
    1. Name: Enter 'selectedIP'
    2. Variable: Enter '${paramIPAddress}'
    3. Type: Select 'String'.
      When you are done your window should look like this:


      Click 'OK' to close the window.
  10. Add a Filter rows Step: You want to filter for only the rows that match the selected IP Address, so expand the 'Flow' section of the Design palette and drag a 'Filter rows' node onto the transformation canvas. Your transformation should look like:

  11. Connect the Get Variables and Filter Rows steps: Hover the mouse over the 'Get Variables' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Filter rows' node. Your canvas should look like this:

  12. Edit the Filter Rows Step: Double-click on the 'Filter Rows' node to edit its properties. Do the following:
    1. Click the <field> box to the left of the = and select 'client_ip'
    2. Click the <field> box to the right of the = and select 'selectedIP'
      When you are done your window should look like this:


      Click 'OK' to close the window.

  13. Add a Sort Rows Step: You want the rows in sorted order by year, so expand the 'Transform' section of the Design palette and drag a 'Sort rows' node onto the transformation canvas. Your transformation should look like:

  14. Connect the Filter Rows and Sort Rows steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Sort rows' node. When you release the mouse and a tooltip appears select the 'Result is TRUE' option. Your canvas should look like this:

  15. Edit the Sort Rows Step: Double-click on the 'Sort rows' node to edit its properties. Do the following:
    1. Fieldname: Enter 'year'
    2. Ascending: Select 'Y'
      When you are done your window should look like this:


      Click 'OK' to close the window.

  16. Add a Row Denormaliser Step: You want to roll up the records for each year into a single row with a field for every month, so expand the 'Transform' section of the Design palette and drag a 'Row Denormaliser' node onto the transformation canvas. Your transformation should look like:

  17. Connect the Sort Rows and Denormaliser steps: Hover the mouse over the 'Sort rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Row Denormaliser' node. Your canvas should look like this:

  18. Edit the Denormaliser Step: Double-click on the 'Row denormaliser' node to edit its properties. Do the following:
    1. The key field: Select 'month_num'
    2. The fields that make up the grouping: Add 'client_ip' and 'year'
    3. Target fields: Enter the following
      Target fieldname Value fieldname Key Value Type
      January Pageviews 1 Integer
      February Pageviews 2 Integer
      March Pageviews 3 Integer
      April Pageviews 4 Integer
      May Pageviews 5 Integer
      June Pageviews 6 Integer
      July Pageviews 7 Integer
      August Pageviews 8 Integer
      September Pageviews 9 Integer
      October Pageviews 10 Integer
      November Pageviews 11 Integer
      December Pageviews 12 Integer


      When you are done your window should look like this:


      Click 'OK' to close the window.

  19. Add two Dummy Steps: Expand the 'Flow' section of the Design palette and drag a 'Dummy (do nothing)' node onto the transformation canvas. Repeat to add a second 'Dummy (do nothing)' step. Your transformation should look like:

  20. Connect the Denormaliser and Dummy steps: Hover the mouse over the 'Row denormaliser' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the first 'Dummy (do nothing)' node. Your canvas should look like this:

  21. Edit the Dummy Step: Double-click on the 'Dummy (do nothing)' node you just connected to edit its properties. Change the Step name to 'Output'.

  22. Connect the Filter Rows and Dummy steps: Hover the mouse over the 'Filter rows' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the second 'Dummy (do nothing)' node. When you release the mouse and a tooltip appears select the 'Result is FALSE' option. Your canvas should look like this:

  23. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'hdfs_to_report.ktr' into a folder of your choice.

Create a Report

In this task you will create a report that uses PDI transformations to source data from HDFS files.

Start Report Designer on your desktop. Once it is running choose 'File' -> 'Report Wizard' -> from the menu system.

Speed Tip
You can download the Report definition HDFS to Report.prpt already completed
  1. Select a Template: Report Wizard will automatically layout and do some basic formatting of your report, so select a template of your choice in the select box. When you are done your screen should look like:


    Click 'Next' to go to the next screen.

  2. Create a Data Source: You need to create the HDFS query to select data for this report, so click the plus button and do the following:
    1. Choose Type 'Pentaho Data Integration'

    2. Create a New Query: Click the button to create a new query and do the following:
      1. Name: Enter 'Page Views'
      2. File: Select the hdfs_to_report.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:

    3. Create a Second Query: Click the button to create a new query and do the following:
      1. Name: Enter 'IP List'
      2. File: Select the hdfs_ip_list.ktr transformation you just created.
      3. Steps: Select 'Output'

        When you are done your window should look like:


        Click 'OK' to close the window.
    4. Highlight the 'Page Views' query in the Report Design Wizard window.
      Your Report Design Wizard window should now look like:


      Click 'Next' to go to the next screen.
  3. Set up Report Layout: You need to tell Report Design Wizard how to lay out your report, so do the following:
    1. Selected Items: Add the following in this order
      1. Year
      2. January
      3. February
      4. March
      5. April
      6. May
      7. June
      8. July
      9. August
      10. September
      11. October
      12. November
      13. December

        When you are done your window should look like:


        Click 'Finish' to complete to the design wizard.

  4. Create a Parameter: You need to create a parameter so you will be able to select a specific IP address when running the report, so in the menu system select 'Data' -> 'Add Parameter' and do the following:
    1. Select the 'Pentaho Data Integration' data source.
    2. Name: Enter 'paramIPAddress'.
    3. Label: Enter 'IP Address'
    4. Value Type: Select 'String'
    5. Check 'Mandatory'
    6. Display Type: 'Drop Down'
    7. Query: Select 'IP List'
    8. Value: Select 'client_ip'
    9. Display Name: Select 'client_ip'

      When you are done your window should look like:


      Click 'OK' to close the window.

  5. Add Parameter to Query: You need to add the parameter you just created to your Page Views query, so in the 'Data' pane expand 'Pentaho Data Integration', right click on 'Page Views' and select 'Edit Query'.


    Then do the following:
    1. Click on the 'Edit Parameter' button.
    2. Under 'Transformation Parameter' click the .
    3. DataRow Column: Select 'paramIPAddress'
    4. Transfomration Parameter: Select 'paramIPAddress'

      When you are done your window should look like:


      Click 'OK' to close the window.

  6. Save the Report: Choose 'File' -> 'Save as...' from the menu system. Save the report as 'hdfs_report.prpt' into a folder of your choice.
  7. Preview the Report: Choose 'File' -> 'Preview' -> 'Print Preview' from the menu system. A 'Print Preview' window will open. Select an 'IP Address' of your choice from the drop down. After a few seconds the report results will appear:

  8. Do any desired formatting to the report.

Summary

In this guide you learned how to construct a report that sources data from a HDFS file.


This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki