Hitachi Vantara Pentaho Community Wiki
Access Keys:
Skip to content (Access Key - 0)

Classifiers

Random Trees and Random Forests (also applies to version 3.6.1)

RandomTree is now much faster. Code taken originally from REPTree to avoid re-sorting of data has been removed: it is not beneficial in this case because it sorts *all* attributes before the tree is built; in the new version, only the local data for the randomly selected attributes is sorted. This also means that RandomForest is much faster (it can be as much as an order of magnitude faster on UCI datasets). The memory footprint has also been reduced. RandomTree now also has an option to perform backfitting, so that unbiased probability estimates can be obtained by backfitting a hold-out set.

Attribute selection

Tabu search

The metaheuristic neighborhood search method applied to feature selection (weka.attributeSelection.TabuSearch). See:

Abdel-Rahman Hedar, Jue Wang, Masao Fukushima: Tabu search for attribute reduction in rough set theory. Soft Comput. 12(9): 909-918 (2008). 

Thanks to Adrian Pino for this contribution.

Wrapper subset evaluator

The Wrapper (weka.attributeSelection.WrapperSubsetEval) subset evaluator now supports other evaluation metrics aside from error rate (classification) and RMSE (regression). Supported metrics now include:  MAE, F-measure, AUC, RMSE  (probabilities), MAE (probabilities).

Association rules

Apriori

Apriori (weka.associations.Apriori)can now make use of market basket-type data in sparse instances format. In this case, zeros (which are not stored explicitly in the sparse format) are used to represent absence of items from baskets. Previously, market basket data was encoded by using Weka's missing value indicator to indicate absence of items. Sparse data allows larger data sets to be loaded and processed by Apriori. 

Filters

EM Imputation of missing values

EMImputation (weka.filters.unsupervised.attribute.EMImputation) replaces missing missing values in a data set by using Expectation Maximization with a multi-variate normal model. This is a sophisticated alternative to Weka's standard imputation using means/modes. See:

Schafer, J.L. Analysis of Incomplete Multivariate Data, New York: Chapman and Hall, 1997. 

Thanks to Amri Napolitano for this contribution. 

Sort nominal labels

Sort the labels of a nominal attribute (weka.filters.unsupervised.attribute.SortLabels.

PMML import

Import  of PMML TreeModelis now supported.

Data converters

Support has been added in Weka 3.7 for reading and writing MatLab's ASCII file format (single matrix per file only).  

Knowledge Flow

The Knowledge Flow now has GUI support for environment variables. Both system and Java variables can be used in file paths and other settings for all data sources and data sinks (including the serialized model saver component).

Groovy scripting

Weka now has support for the basic editing and execution of Groovyscripts. Groovy is a dynamic language, which allows you to quickly experiment with using Weka's core classes programatically. See also the Groovy scripting plugin for the Knowledge Flow.

Instance weights in the Explorer's Preprocess panel

Since instance weights can now be specified in standard ARFF and XML-based ARFF formats, the Preprocess panel of the Explorer has been upgraded to reflect information on weights and take weights into account for displayed statistics and histograms. 


This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder Powered by Atlassian Confluence