The Knowledge Flow plugin is an enterprise edition tool that allows entire data mining processes to be run as part of a Kettle (PDI) ETL transformation. There are a number of use cases for combining ETL and data mining, such as:
- Automated batch training/refreshing of predictive models
- Including data mining results in reports
- Access to data mining data pre-processing techniques in ETL transformations
Training/refreshing of predictive models is the application described in this document and, when combined with the Weka Scoring plugin for deploying predictive models, can provide a fully automated predictive analytics solution.
The Knowledge Flow plugin requires Kettle 3.1 or higher and Weka 3.6 or higher. Due to SWT-AWT problems under Mac OS X, OS X users will require the Eclipse Cocoa 64 bit SWT libraries (version 3.5) in order to use the plugin. These libraries can easily be dropped in to replace the ones included in the Kettle Mac application (Kettle.app/Contents/Resources/Java/libswt/osx).
Before starting Kettle's Spoon UI, the Knowledge Flow Kettle plugin must be installed in either the plugins/steps directory in your Kettle distribution or in $HOME/.kettle/plugins/steps. Unpack the Knowledge Flow archive and copy the contents of the KFDeploy directory to a new subdirectory of $HOME/.kettle/plugins/steps. Copy the "weka.jar" file from your Weka distribution to the same subdirectory of $HOME/.kettle/plugins/steps.
The Knowledge Flow Kettle plugin also requires a small plugin to be installed in the Weka Knowledge Flow application. This plugin provides a special data source component for the Weka Knowledge Flow that accepts incoming data sets from Kettle. Copy the contents of the "KettleInject" directory to a subdirectory in $HOME/.knowledgeFlow/plugins. If the $HOME/.knowledgeFlow/plugins directory does not exist, you will need to create it manually.
Once installed correctly, you will find the Kettle Knowledge Flow step in the "Transform" folder in the Spoon user interface.
As a simple example, we will use the Knowledge Flow step to create and export a predictive model for the "pendigits.csv"data set (docs/data/pendigits.csv). This data set is also used in the "Using the Weka Scoring Plugin"documentation.
First construct a simple Kettle transformation that links a CSV input step to the Knowledge Flow step. Next configure the input step to load the "pendigits.csv" file. Make sure that the Delimiter text box contains a "," and then click "Get Fields" to make the CSV input step analyze a few lines of the file and determine the types of the fields.
All the fields in the "pendigits.csv" file are integers. However, the problem is a discrete classification task and Weka will need the "class" field to be declared as a nominal attribute. In the CSV input step's configuration dialog, change the type of the "class" field from "Integer" to "String."
The Knowledge Flow step's configuration dialog is made up of three tabs (although only two are visible initially when the dialog is first opened). The first tab, "KnowledgeFlow file," enables existing Knowledge Flow flow definition files to be loaded or imported from disk. It also allows you to configure how the incoming data from the transformation is connected to the Knowledge Flow process and how to deal with the output.
If a flow definition is loaded, then the definition file will be loaded (sourced) from the disk every time that the transformation is executed. If, on the other hand, the the flow definition file is imported, it will be stored in either the transformation's XML configuration file (.ktr file) or the repository (if one is being used).
A third option is to design a new Knowledge Flow process from scratch using the embedded Knowledge Flow editor. In this case the new flow definition will be stored in the .ktr file/repository. This is the approach we will take for the purposes of demonstration. Clicking the "Show embedded KnowledgeFlow editor" button will cause a new "KnowledgeFlow" tab to appear on the dialog.
You may need to enlarge the size of the Knowledge Flow step's dialog in order to fully see the embedded editor.
To begin with, we will need an entry point into the data mining process for data from the Kettle transformation. Select the "Plugins" tab of the embedded editor and place a "KettleInject" step onto the layout canvas. If there is no "Plugins" tab visible, or there is no "KettleInject" step available from the "Plugins" tab, you will need to review the installation process described earlier.
Next, connect a "TrainingSetMaker" step to the "KettleInject" step by right clicking over "KettleInject" and selecting "dataSet" from the list of connections.
Now add a logistic regression classifier to the flow and connect it by right clicking over "TrainingSetMaker" and selecting "trainingSet" from the list of connections.
Next, connect a "SerializedModelSaver" step and connect it by right clicking over "Logistic" and selecting "batchClassifier" from the list of connections.
Now configure the "SerializedModelSaver" in order to specify a location to save the trained model to. Either double click the icon or right click over it and select "Configure..." from the pop-up menu. If you are using Weka version 3.7.x, there is support for environment variables in the Knowledge Flow and Kettle's internal environment variables are available. In the screenshot below, we are saving the trained classifier to $
- this is the directory that the Kettle transformation has been saved to (Note: this only makes sense if a repository is not being used). You can always specify an absolute path to a directory on your file system, and, in fact, this is necessary if you are using Weka version 3.6.x.