How to unit test the mapper and reducer transformations that make up a Pentaho MapReduce job. Unit testing mapper and reducer code before it runs on a cluster can generate significant development time savings. Using the PDI development environment to debug and performance test mapper and reducer code is more productive than poring over Hadoop logs! Pentaho recommends that you unit test your Pentaho MapReduce transformations locally before running them on the cluster.
The general technique is to stub the input data to the mapper (or reducer) transformation with a File Input or generated rows of data, then execute the transformation in preview mode to ensure that it is processing correctly. In the steps that follow you will create a stub file. Alternatively, in situations where the key field is not important or the original file contains the key field you may be able to read your original file from Hadoop via a Hadoop File Input step.
In order to follow along with this guide you will need the following:
- Pentaho Data Integration
This guide uses the weblog_parse_mapper.ktr from the Using Pentaho MapReduce to Parse Weblog Data in MapR guide. If you have completed that guide, you should already have this mapper. Otherwise, click on the link above to download it.
In this task you will create a test file that you will use to unit test your transformation.
- In a text editor create a new file in key tab value format like your transformation would receive. For reducer transformations the keys must be in sorted order and should only contain one value per line. Repeat the key on multiple lines for multiple values.
For this guide use:
In this task you will unit test the transformation. The same steps may be used for both mapper and reducer transformations.
- Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select the 'weblog_parse_mapper.ktr', then click 'OK'.
- Add a Text File Input Step: You will provide the mapper with an alternate input step, so expand the 'Input' section of the Design palette and drag a 'Text File Input' step onto the job canvas.
NOTE: You could also use a Hadoop File Input step to pull the test file from the Hadoop cluster.
- Connect the Text File Input and Regex Evaluation Steps: Hover the mouse over the 'Text File Input' node and a tooltip will appear. Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Regex Evaluation' node. Click 'OK' on the warning message. Your canvas should look like this:
- Edit the Text File Input Step: Double-click on the 'Text File Input' node to edit its properties. Enter this information:
- File or directory: Browse to the test file you created earlier.
- Click the Add button.
When you are done your window should look like this:
- Configure the File Content: Switch to the 'Content' tab and enter the following information:
- Separator: Clear and click 'Insert TAB'
- Uncheck 'Header'
- Format: Select 'Mixed'
- Configure the Fields: Switch to the 'Fields' tab and enter the following information:
- Create a field with Name 'key' and Type 'String'
- Create a field with Name 'value' and Type 'String'
- Disable the Map/Reduce Input Hop: Disable the hop between Map/Reduce Input and RegEx Evaluation by clicking on the hop. The hop will turn gray.
- Run the Unit Test: Highlight then right click on the 'User Defined Java Expression' step and select 'Preview'. A 'Transformation debug dialog' will appear, click 'Quick Launch'. The results of the transformation will appear in the 'Examine preview data' window.
Click 'Close' to close the window.
- Disable the Text File Input Hop: Disable the hop between 'Text File Input' and 'Regex Evaluation' by clicking on it. It will turn gray.
- Enable Map/Reduce Input Hop: Enable the hop between 'Map/Reduce Input' and 'Regex Evaluation' by clicking on it. It will turn black.
- Save the Transformation
In this guide you learned how to unit test Pentaho MapReduce Transformations. It is recommended that you unit test your transformations in this way as debugging using Hadoop logs is both complex and time consuming.