|Table of Content Zone|
The Weka scoring plugin is a tool that allows classification and clustering models created with Weka to be used to "score" new data as part of a Kettle transform. "Scoring" simply means attaching a prediction to an incoming row of data. The Weka scoring plugin can handle all types of classifiers and clusterers that can be constructed in Weka. As of Weka version 3.6.0, it can also handle certain types of models expressed in the Predictive Modeling Markup Language (PMML). The Weka scoring plugin provides the ability to attach a predicted label (classification/clustering), number (regression) or probability distribution (classification/clustering) to a row of data.
2 Getting Started
In order to use the Weka scoring plugin, a model must first be created in Weka and then exported as a serialized Java object to a file. Alternatively, an existing supported PMML model can be used. The model can then be loaded by the plugin and applied to fresh data. This section briefly describes how to create and export a model from Weka.
2.1 Starting the Weka Explorer
Assuming you have Weka 3.6.x or 3.7.x installed, launch the Weka environment by double-clicking on the weka.jar file or by selecting it from the Start menu (under Windows). Once the UI is visible, click the Explorer button.
The latter approach allows you to control how much memory is made available to the Java virtual machine through the use of the "-Xmx" flag.
2.2 Loading Data into the Explorer
Data can be imported into the explorer from files (arff, csv or c4.5 format) or from databases. In this example we will load data from a file in Weka's native arff (Attribute Relation File Format) format. Click on "Open File" and select the "pendigits.arff" file (this file is located in the docs/data directory in the Weka Scoring plugin archive). The file will be loaded and summary statistics for the attributes shown in the Preprocess panel.
2.3 Building a Classifier
In the Classifier panel of the Explorer first choose a learning scheme to apply to the training data. In this example you will use a decision tree learner (J48).
2.4 Exporting the Trained Classifier
You can save export any classifier that you have trained in the Classifier panel by right clicking on its entry in the Results History. Trained models are stored on disk a serialized Java objects. Save this model to a file called "J48" (a ".model" extension will be added for you).
3 Using the Weka Classifier in Kettle
Using the trained model in Kettle to score new data is simply a matter of configuring the Weka scoring plugin to load and apply the model file you created in the previous section.
A Simple Example
As a simple demonstration of how to use the scoring plugin, you will use the model you created in Weka to score the same data that it was trained on. First start Spoon, and then construct a simple transform that links a CSV input step to the Weka scoring step.
4 Advanced Features
4.1 Storing Models in Kettle XML Configuration File or Repository
When a transform is executed, the Weka scoring plugin will load the model from disk using the path specified in the File tab. It is also possible to store the model in Kettle's XML configuration file or the repository and use this instead when the transform is executed. To do this, first load a model into the Weka scoring step as described previously. Once you are satisfied that the fields have been mapped correctly and the model is correct, you can clear the "Load/import model" text box and click the "OK" button. When the transform is saved, the model will be stored in the XML configuration file or the repository (if one is being used).
4.2 Updating Incremental Models on the Incoming Data Stream
If the Weka model being used is an incremental one — that is, one that can be updated/trained on a row-by-row basis — then this can be turned on by selecting the "Update model" checkbox. Furthermore, the updated model can be saved to a file after the transform completes by providing a file name in the "Save updated model" text box.
4.3 Using PMML Models
As of Weka version 3.6.0, certain types of models expressed in the Predictive Modeling Markup Language can be imported by Weka and by the Weka scoring plugin. Version 3.6.0 can import Regression, General regression and Neural Network models. 3.7.0 adds Tree models. Loading a PMML model file into the Weka scoring step is simply a matter of browsing to the location of the XML file that contains the model. Selecting "PMML model file" in the file type drop down box of the file chooser dialog will show all files with a ".xml" extension. Once loaded, the model can be used in exactly the same way as a Weka native model.
More information on PMML support in Weka, including a roadmap for development, can be found here.
5 Tips and Tricks
5.1 Maximizing Throughput
In order to maximize the data throughput in the Weka scoring step, it is necessary to understand how Weka represents attribute values internally. All values encapsulated in Weka's Instance class are stored in primative Java double floating point format. This is the case for integer, real and nominal (discrete) values. In the case of the latter, the value represents the index of the discrete value, stored in double floating point format. Maximum speed, when the scoring step converts Kettle rows to Weka Instances, will be achieved when the incoming Kettle rows contain values that are Numbers (Doubles) for all numeric fields and Strings for all discrete fields. Integers or Booleans will trigger a conversion to Double. When there are a lot of fields, this conversion will affect performance.
5.2 A Note About the CSV Input Step
Kettle's CSV input step has an option called "Lazy conversion". By default, lazy conversion is turned on. This means that fields are read from the csv file into byte arrays. Construction of objects that represent the correct type for a particular field only occurs when a Kettle step requests a particular value from a row of data. This makes reading rows of data from the csv file extremely fast, and little overhead is imposed as long as downstream steps are not requesting a lot of values from the data rows. In the case of the Weka scoring step, most, if not all, values in a row will need to be accessed in order to construct an Instance to pass to the classifier. When there are many attributes/fields this can result in a performance hit. Better performance can be achieved by turning off "Lazy conversion" in the CSV input step. This has the effect of making the CSV input step construct the correct type of object for a field as data is read in.