Hitachi Vantara Pentaho Community Wiki
Child pages
  • Using the Knowledge Flow Plugin

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If a flow definition is loaded, then the definition file will be loaded (sourced) from the disk every time that the transformation is executed. If, on the other hand, the the flow definition file is imported, it will be stored in either the transformation's XML configuration file (.ktr file) or the repository (if one is being used).


A third option is to design a new Knowledge Flow process from scratch using the embedded Knowledge Flow editor. In this case the new flow definition will be stored in the .ktr file/repository. This is the approach we will take for the purposes of demonstration.

4.2.1 Creating a Knowledge Flow Process Using the Embedded Editor

Clicking the "Show embedded KnowledgeFlow editor" button will cause a new "KnowledgeFlow" tab to appear on the dialog.

...

Finally, add a "TextViewer" step to the layout and connect it to the "Logistic" step by right clicking over "Logistic" and selecting "text" from the list of connections. The "TextViewer" step will receive the textual description of the model structure learned by the logistic regression.


4.2.2 Linking the Knowledge Flow Data Mining Process to the Kettle Transformation

Now we can return to the "KnowledgeFlow file" tab in the Knowledge Flow Kettle step's configuration dialog and establish how data is to be passed in to and out of the Knowledge Flow process that we've just designed. First click the "Get changes from KnowledgeFlow editor" button. This will extract the flow from the editor and populate the drop-down boxes with applicable step and connection names. To specify that incoming data should be passed in to the Knowledge Flow process, select the "Inject data into KnowledgeFlow" checkbox and choose "KettleInject" in the "Inject step name" field. The "Inject connection name" field will be automatically filled in for you with the value "dataSet."



The choices for output include either passing the incoming data rows through to downstream Kettle steps or to pick up output from the Knowledge Flow process and pass that on instead. In this example we will do the latter by picking up output from the "TextViewer" step in the Knowledge Flow process. Note that the "SerializedModelSaver" step writes to disk and does not produce output that we can pass on inside of a Kettle transformation. Select "TextViewer" in the "Output step name" field and "text" in the "Output connection name" field. Make sure to leave "Pass rows through" unchecked.



4.2.3 Choosing Fields and Configuring Sampling

The second tab of the Knowledge Flow Kettle plugin's configuration dialog allows you to specify which of the incoming data fields are to be passed in to the data mining process and whether or not to down sample the incoming data stream.

...

The next two text fields relate to sampling the incoming Kettle data. The Knowledge Flow Kettle step has built in Reservoir sampling (similar to that of the separate Reservoir Sampling plugin step). In batch training mode (incremental training is discussed in the "Advanced Features" section below) the "Sample/cache size (rows)" text field allows you to specify how many incoming Kettle rows should be randomly sampled and passed on to the Knowledge Flow data mining process. Reservoir sampling ensures that each row has an equal probability of ending up in the sample (uniform sampling). The "Random seed" text field provides a seed value for the random sampling process - changing this will result in a different random sample of the data. Entering a 0 (zero) in the "Sample/cache size" field tells the step that all the incoming data should be passed on to the data mining process (i.e. no sampling is to be performed). Make For the purposes of this example, make sure that you enter a zero in this field or change the value to something more than the default 100 rows.

 

The "Set class attribute" allows you to indicate that a class or target attribute is to be set on the data set created for the data mining process. Select this checkbox and then select the "class" field from the "Class attribute" drop-down box.

4.3 Running the Transformation

Before running the transformation, add a "Text file output" step in order to save the textual description of the logistic regression model. Note that instead of saving this model output, we could just as easily have it placed in a report by using this transformation as part of an action sequence running on the Pentaho BI server.
Image Added

Now you can save the transformation and run it. Depending on the speed of your computer, the Knowledge Flow process may take up to a minute or so to train the logistic model. We can find the binary serialized model and textual model description in the same directory that the .ktr transformation file was saved to.
Image Added
Image Added