Access Keys:
Skip to content (Access Key - 0)


Bayesian logistic regression and discriminitive multinomial naive Bayes for text classification

Two fast and powerful techniques for text classification problems. Often outperforms SMO (support vector machine) without parameter tuning. (weka.classifiers.bayes.BayesianLogisticRegression, weka.classifiers.bayes.DMNBtext). See:

Alexander Genkin, David D. Lewis, David Madigan (2004). Large-scale bayesian logistic regression for text categorization (

Jiang Su,Harry Zhang,Charles X. Ling,Stan Matwin (2008). Discriminative Parameter Learning for Bayesian Networks. In: ICML 2008'.

Functional trees

Jaoa Gama's tree learner that incorporates oblique splits and functions at the leaves (weka.classifiers.trees.FT). See:

Jaoa Gama (2004). Functional Trees. Machine Learning, Vol. 55(3), Kluwer Academic Press.

Decision table/naive Bayes hybrid classifier

A semi-naive Bayesian ranking mehod that combines decision tables with naive Bayes (weka.classifiers.rules.DTNB). See:

Mark Hall, Eibe Frank (2008). Combining Naive Bayes and Decision Tables. In: Proceedings of the 21st Florida Artificial Intelligence Society Conference. Miami, Florida. AAAI Press.



A clustering algorithm for transactional data (weka.clusterers.CLOPE). See:

Yiling Yang, Xudong Guan, Jinyuan You (2002). CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 682-687.


Clustering using the sequential information bottleneck algorithm (weka.clusterers.sIB). See:

Noam Slonim, Nir Friedman, Naftali Tishby (2002). Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, 129-136.

Attribute selection

Cost-sensitive attribute and subset evaluation

Via re-weighting/sampling of the input data according to a supplied cost matrix (weka.attributeSelection.CostSensitiveAttributeEval, weka.attributeSelection.CostSensitiveSubsetEval).

Filtered attribute and subset evaluation

Apply a filter (or set of filters) to the input data before applying attribute selection (weka.attributeSelection.FilteredAttributeEval, weka.attributeSelection.FilteredSubsetEval).

Latent semantic analysis

Perform SVD-based latent semantic analysis via the attribute selection interface, or transform data using LSA via the AttributeSelection filter (weka.attributeSelection.LatentSemanticAnalysis, weka.filters.supervised.attribute.AttributeSelection). 

Improved output

Output has been improved for naive Bayes, logistic regression and k-means clustering. 



Plugin support

The KnowledgeFlow now o?ers the ability to easily add new components via a plugin mechanism. Plugins are installed in a directory called .knowledgeFlow/plugins in the user's home directory and are dynamically loaded by the KnowledgeFlow at runtime.

Headless execution

Flows can now be executed from outside of the GUI KnowledgeFlow environment. weka.gui.beans.FlowRunner can be executed from the command line, or used programatically, to run multiple flows in parallel.

Instance weights

While instance weights have been used internally by meta classifiers (e.g. boosting methods and such like) for ages, it has only been possible to specify them in ARFF files by using the XML-based XRFF (eXtensible attribute-Relation File Format) format. Now it is possible to specify instance weights in standard ARFF files.

A weight can be associated with an instance in a standard ARFF file by appending it to the end of the line for that instance and enclosing the value in curly braces. E.g:

0, X, 0, Y, "class A", {5}

For a sparse instance, this example would look like:

{1 X, 3 Y, 4 "class A"}, {5}

Any instance without a weight value specified is assumed to have a weight of 1 for backwards compatibility.

 Running an experiment using clusterers

Using the advanced mode of the Experimenter it is now possible to run experiments on clustering algorithms as well as classifiers. The main evaluation metric for this type of experiment is the log likelihood of the clusters found by each clusterer.

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder Powered by Atlassian Confluence