Hitachi Vantara Pentaho Community Wiki
Child pages
  • Loading Data into HBase

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
Excerpt

...

Using a PDI transformation that sources data from a flat file and writes to an HBase table.

Info
titleNote

For brevity's sake, we will use a prepared dataset and a simple transformtransformation. In practice, you have and will use the full power of the PDI transformation semantic functionality to transform and prepare your data for HBase loads.

...

The sample data file needed for this guide is

Anchor
_GoBack
_GoBack
:

File Name

Content

weblogs_hbase.txt.zip

Prepared data for Hbase load

Step-By-Step Instructions

...

Create a HBase Table

  1. Open the HBase Shell: Open Connect to the HBase shell by entering 'hbase shell' at the command linevia ssh terminal.
  2. Create the Table in HBase: Enter the following in the HBase shell.

    Code Block
    create 'weblogs', 'pageviews'

    This creates the weblogs table with a single column family named pageviews.

  3. Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit'.Type "quit" to exit the hbase shell. 


Create a Transformation to Load Data into HBase

In this task you will load a file into HBase.

Tip
titleSpeed Tip

You can download Downloading the Kettle Transformation load_hbase.ktr already completed will save time as it is already configured to load the HBase data.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Transformation' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Transformation' option.
  2. Add an Input Step: You need to tell PDI where to get the data from, so expand the Expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the transformation canvas.
    Image Removed
    Image Added
  3. Edit the Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
    1. File or Directory: Browse to the weblog_hbase.txt file.
    2. Click the 'Add' button.
      When you are done your window should look like(click to enlarge):
      Image RemovedImage Added
  4. Configure File Content: Switch to the 'Content' tab and do the following:
    1. Separator: Clear and click the 'Insert TAB' button.
    2. Header: Check the 'Header' checkbox
    3. Format: Select 'Unix'
      When you are done your window should look like(click to enlarge):
      Image RemovedImage Added
  5. Configure the Input Fields: Switch to the 'Fields' tab and do the following:
    1. Click the 'Get Fields' Button
    2. When prompted for 'Number of sample lines' use 100 and click 'OK'
    3. Change the 'Type' for the 'key' field to 'String' and the 'Length' to 20.
      When you are done your window should look like(click to enlarge):
      Image RemovedImage Added
      Click 'OK' to close the window.
  6. Add a HBase Output Step: You are going to store your data in HBase, so expand the 'Big Data' section of the Design palette and drag a 'HBase Output' node onto the transformation canvas. Your transformation should look like(click to enlarge):
    Image RemovedImage Added
  7. Connect the Input and Output steps(if they are not already): Hover the mouse over the 'Text file input' node and a tooltip will appear. Image Removed Click  Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'HBase Output' node. Your canvas should look like:
    Image RemovedImage Added
  8. Edit the HBase Output Step: Double-click on the 'HBase Output' node to edit its properties. Enter this information:
  9. Zookeeper host(s): A comma separated list of your HBase Zookeeper Hosts. For local single node clusters use 'localhost'.
  10. Zookeeper port: The port for your Zookeeper hosts. By default this is '2181'.
    When you are done your window should look like:
    Image Removed 
    a. Select the cluster in the drop down menu. 
    b. Click on Get table names and select "weblogs" from the drop down. 
    NOTE- If the table/mapping names are not present(the drop down is empty) you will need to create them under the "Create/Edit mappings" tab, and save the mapping. It will then show under the dropdown menu. 

    1. Image Added


  11. Create a HBase Mapping: You need to tell Pentaho how to store the data in HBase, so switch to the 'Create/Edit mappings' tab and do the following:
    1. HBase table name: Select 'weblogs'.
    2. Mapping name: Enter 'pageviews'.
    3. Click the 'Get incoming fields' button.
    4. For the alias 'key' change the 'Key' column to 'Y', empty the 'Column family' and 'Column name' fields and set the 'Type' field to 'String'
    5. Click the 'Save mapping' button.
      When you are done your window should look like:
      Image RemovedImage Added
  12. Finish Configuring the Connection: You need to tell the HBase output to use the mapping you just created, so switch back to the 'Configure connection' tab and do the following:
    1. Click the 'Get table names' button.
    2. HBase table name: Select 'weblogs'.
    3. Click the 'Get mappings for the specified table' button.
    4. Mapping name: Select 'pageviews'.
      When you are done your window should look like:
      Image RemovedImage Added
      Click 'OK' to close the window.
  13. Save the Transformation: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hbase.ktr' into a folder of your choice.
  14. Run the Transformation: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the transformation toolbar. A 'Execute a transformation' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the transformation as it runs. After several seconds the transformation should finish successfully:

If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check HBase

  1. Open the HBase Shell: Open the HBase shell so you can check that your table loaded by entering 'hbase shell' at the command line.
  2. Scan the Table: You want to scan the table to ensure data loaded, so run the following command.

    Code Block
    scan 'weblogs', {LIMIT => 10}
  3. Close the HBase Shell: You are done with the HBase Shell for now, so close it by entering 'quit' in the HBase Shell.

...