Weka 3.7.2 moves away from a single monolithic executable jar file to a modular package-based system. Although Weka's single jar file is only ~6Mb in size, it had become bloated in terms of the number of algorithms and options available. The plan was to have a stripped down "core" jar file that contains all the infrastructure plus a handful of the most well known algorithms from each of the main learning categories. All other algorithms will be available to the user as downloads via a package management system.
The main benefits of this approach are twofold. From the users perspective, Weka is less overwhelming (in terms of what is available initially) and easier to get started with. From the Weka maintainer's perspective, maintenance becomes less of a burden as it is made explicit which packages are external contributions and which come from the Weka team. Community members seeking help with an algorithm can either ask on the Weka forums (Pentaho or the Weka mailing list), or contact the author of the package in question.
Packages in 3.7.2 are hosted by either the Weka team (for internal code) or the author (for contributed code). The Weka team maintains a repository of meta data on all the available packages (not unlike the CRAN system used for the R statsitical software). Both command line and graphical package management clients are available. The package management system subsumes the existing plugin mechanisms in Weka (visualization plugins in the Explorer and the Knowledge Flow's plugin system). To alleviate library duplication, packages are able to depend on other packages (as well as a given version of the core system). The package management software takes care of resolving dependencies and detecting conflicts. This approach makes it possible for contributers to Weka to easily make use of external libraries. In the past we have avoided the use of external libraries due to the added complication they introduce to maintenance, installation and use of Weka. Under the new system, it is the responsibility of the contributer to make sure that their package(s) stay compatible with changes to external libraries (if used).
More information on using the package management system, the structure of packages, and how to contribute a package are available from the Weka Wiki on Wikispaces:
scatterPlot3D is a new package for 3.7.2 that adds a 3D scatter plot visualization to the Explorer. It requires Java 3D to be installed before use.
massiveOnlineAnalysis is another new package for 3.7.2 that adds the data stream learning techniques of the MOA tool to Weka. MOA includes incremental learning algorithms that can process (potentially) infinite amounts of data. The user has the option to specify bounds on memory consumption by a learning technique. Algorithms such as Hoeffding trees, Hoeffding trees with naive Bayes models at the leaves, Hoeffding option trees, bagging, boosting and adaptive variants thereof are available. General information on handling large data sets with Weka can be found at the Handling Large Data Sets with Weka page.
Another new package for 3.7.2 that adds a 3D visualization of association rules to Associations panel of the Explorer. Requires Java 3D to be installed.
GridSearch (now in a package called "gridSearch") is now multi-threaded to take advantange of multi-core machines.
SGD is a simplified stochastic gradient descent algorithm for fast learning of linear support vector machines (binary class), logistic regression (binary class) and linear regression. Can be trained incrementally, making it suitable for processing large data sets.
Import of PMML support vector machines models is now supported. See the PMML support in Weka page for more info on PMML support.
Denormalize is a filter for flattening transactional data, making it suitable for processing by Weka's association rule learners. MathExpression can now reference values other than that of the attribute being processed.
Association rule learning
FPGrowth now has a special command-line only option that enables the two passes over the data required by the algorithm to be done incrementally by reading one instance at a time off of the disk.
The scatter plot matrix visualization now has a "fast scroll" feature that allows good scrolling performance when many data points are being visualized (consumes more memory than regular scrolling).