Quick Start and Overview
Pentaho Data Mining, based on Weka project, is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules, and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.
There are three versions of Weka:
- Weka 3.4 - stable branch that was created in 2003 to correspond with what is described in the 2nd edition of the Witten and Frank Data Mining book (published 2005). This branch is feature frozen and receives only bug fixes. It is also reaching end of life.
- Weka 3.6 - stable branch that was created in mid 2008 to correspond with what is described in the 3rd edition of the Witten, Frank and Hall Data Mining book (published January 2011). This branch is feature frozen and receives only bug fixes.
- Weka 3.7 - development branch. This is a continuation of the 3.6 code line that receives both bug fixes and new features.
- Pentaho Data Mining Home Page (News, Downloads, Forums, Bug tracking etc.)
- Pentaho Data Mining Forum
- FAQ
- Pentaho Data Mining Screenshots
- Video Tutorial: Pentaho Data Mining Overview and Use Case
- A collection of videos on using Weka from dataminingtools.net
- A collection of Weka videos contributed by Bill Claster
- A nice introductory article on data mining with Weka at IBM Developerworks by Michael Abernethy
- Tutorial slides on Weka from dataminingtools.net
Documentation
Pentaho Data Mining (Weka)
- English documentation for Weka 3.6.8 (stable book 3rd ed. version)
- English documentation for Weka 3.7.7 (development version)
- English documentation for Weka 3.4.19 (book 2nd ed. version): Explorer guide Experimenter tutorial
- Wiki at wikispaces.com
- What's new in Weka 3.7.8
- What's new in Weka 3.7.7
- What's new in Weka 3.7.6
- What's new in Weka 3.7.5
- What's new in Weka 3.7.4
- What's new in Weka 3.7.3
- What's new in Weka 3.7.2
- What's new in Weka 3.7.1
- What's new in Weka 3.7.0
- What's new in Weka 3.6.0
- What's new in Weka 3.5.8
- Data Mining Algorithms and Tools in Weka
- A white paper on deploying Weka models with Pentaho
- Time Series Analysis and Forecasting with Weka
- R-Project integration
- Handling large data sets with Weka
- Cost/Benefit tool for analysis of direct mail applications
- Running and using Weka server instances
There is a book that has been written to accompany Weka - Data Mining: Practical Machine Learning Tools and Techniques (Third Edition).
Plugins for Pentaho Data Integration (Kettle)
- Using the Weka Scoring Plugin (download)
- Using the Reservoir Sampling Plugin (download)
- Using the ARFF Output Plugin (download)
- Using the Univariate Statistics Plugin (download)
- Using the Knowledge Flow Plugin (enterprise edition)
- Time Series Analysis and Forecasting with Weka (available as a PDI Spoon perspective as well as a Weka plugin)
- Weka time series forecasting plugin for PDI 4 (enterprise edition)
- 3D Visualization Perspective for PDI 4 (download)
Developing with Weka
Awards and Publications
- Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3rd edition, 2011.
- Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. WEKA-experiences with a java open-source project. Journal of Machine Learning Research, 11:2533-2541, 2010.
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann and Ian H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 2009.
- ACM SIGKDD Service Award 2005
Under Development/Roadmap
- UI/functionality refresh for Weka's Knowledge Flow (multiple layouts in separate tabs, tree list of components, parallel or serial execution of separate paths in a single flow, cut/copy and paste, "notes" on the flow layout, plugin "perspectives"...)
- PMML Support in Weka
- NoSQL (Cassandra and HBase) data sources and data sinks for the Knowledge Flow and Explorer
- Incremental dictionary creation and vectorization (StringToWordVector filter) for text documents
- Incremental SGD (SVM, logistic regression) and naive Bayes multinomial classifiers for learning directly from text (string attributes in Weka)
- Cluster visualization
Archived
- Support for parallelism in ensemble learning (Bagging, Vote, RandomCommittee etc.)
- KnowledgeFlow plugin for Kettle (ETL + Data Mining)
- HotSpot algorithm for automatic segmentation/profiling
- Groovy scripting component for the KnowledgeFlow
- Exporting visualizations from Knowledge Flow processes