Quick Start and Overview
Pentaho Data Mining, based on Weka project, is a comprehensive set of tools for machine learning and data mining. Its broad suite of classification, regression, association rules, and clustering algorithms can be used to help you understand the business better and also be exploited to improve future performance through predictive analytics.
- Pentaho Data Mining Home Page (Weka download links can be found here)
- Pentaho Data Mining Forum
- FAQ
- Pentaho Data Mining Screenshots
- Video Tutorial: Pentaho Data Mining Overview and Use Case
- A collection of videos on using Weka from dataminingtools.net
- A collection of Weka videos contributed by Bill Claster
- A nice introductory article on data mining with Weka at IBM Developerworks by Michael Abernethy
- Tutorial slides on Weka from dataminingtools.net
Documentation
Pentaho Data Mining (Weka)
- English documentation for Weka 3.6.3 (stable version)
- English documentation for Weka 3.7.2 (development version)
- English documentation for Weka 3.4.17 (book version): Explorer guide Experimenter tutorial
- Data Mining Algorithms and Tools in Weka
- Handling large data sets with Weka
- Wiki at wikispaces.com
- What's new in Weka 3.7.2
- What's new in Weka 3.7.1
- What's new in Weka 3.7.0
- What's new in Weka 3.6.0
- What's new in Weka 3.5.8
There is a book that has been written to accompany Weka - Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Plugins for Pentaho Data Integration (Kettle)
- Using the Weka Scoring Plugin (download)
- Using the Reservoir Sampling Plugin (download)
- Using the ARFF Output Plugin (download)
- Using the Univariate Statistics Plugin (download)
- Using the Knowledge Flow Plugin (enterprise edition)
- A white paper on deploying Weka models with Pentaho
Developing with Weka
Awards and Publications
- ACM SIGKDD Service Award 2005
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann and Ian H. Witten. (2009).The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11 (1).
Under Development
- Weka "Lite" - package management system for Weka
- Data mining component for the BI platform
- PMML Support in Weka
- Groovy scripting component for the KnowledgeFlow
- Cost/Benefit tool for analysis of direct mail applications
- Support for parallelism in ensemble learning (Bagging, Vote, RandomCommittee etc.)
Archived
- KnowledgeFlow plugin for Kettle (ETL + Data Mining)
- HotSpot algorithm for automatic segmentation/profiling