1 Introduction

The Knowledge Flow plugin is an enterprise edition tool that allows entire data mining processes to be run as part of a Kettle (PDI) ETL transformation. There are a number of use cases for combining ETL and data mining, such as:

Training/refreshing of predictive models is the application described in this document and, when combined with the Weka Scoring plugin for deploying predictive models, can provide a fully automated predictive analytics solution.

2 Requirements

The Knowledge Flow plugin requires Kettle 3.1 or higher and Weka 3.6 or higher. Due to SWT-AWT problems under Mac OS X, OS X users will require the Eclipse Cocoa 64 bit SWT libraries (version 3.5) in order to use the plugin. These libraries can easily be dropped in to replace the ones included in the Kettle Mac application (Kettle.app/Contents/Resources/Java/libswt/osx).

3 Installation

Before starting Kettle's Spoon UI, the Knowledge Flow Kettle plugin must be installed in either the plugins/steps directory in your Kettle distribution or in $HOME/.kettle/plugins/steps. Unpack the Knowledge Flow archive and copy the contents of the KFDeploy directory to a new subdirectory of $HOME/.kettle/plugins/steps. Copy the "weka.jar" file from your Weka distribution to the same subdirectory of $HOME/.kettle/plugins/steps.

The Knowledge Flow Kettle plugin also requires a small plugin to be installed in the Weka Knowledge Flow application. This plugin provides a special data source component for the Weka Knowledge Flow that accepts incoming data sets from Kettle. Copy the contents of the "KettleInject" directory to a subdirectory in $HOME/.knowledgeFlow/plugins. If the $HOME/.knowledgeFlow/plugins directory does not exist, you will need to create it manually.

Once installed correctly, you will find the Kettle Knowledge Flow step in the "Transform" folder in the Spoon user interface.
 

4 Using the Knowledge Flow Plugin

As a simple example, We will use the Knowledge Flow step to create and export a predictive model for the "pendigits.csv"data set (docs/data/pendigits.csv). This data set is also used in the "Using the Weka Scoring Plugin"documentation.

First construct a simple Kettle transformation that links a CSV input step to the Knowledge Flow step. Next configure the input step to load the "pendigits.csv" file. Make sure that the Delimiter text box contains a "," and then click "Get Fields" to make the CSV input step analyze a few lines of the file and determine the types of the fields.
 

All the fields in the "pendigits.csv" file are integers. However, the problem is a discrete classification task and Weka will need the "class" field to be declared as a nominal attribute. In the CSV input step's configuration dialog, change the type of the "class" field from "Integer" to "String."