The wekaServer package is a new package for the development branch of Weka that was released in conjunction Weka (3.7.5) in late October 2011. Inspired by the excellent Carte slave servers for Kettle, Weka server instances follow a similar design in that they run in the lightweight Jetty web server and are driven by servlets. Like Carte, Weka server instances can be monitored via a web browser. Unlike Carte, they are general-purpose compute engines that pass information via serialized Java objects and can run anything that implements a Task interface.
Weka has had an RMI-based distributed execution environment for experiments for over 10 years. This approach has the advantage of allowing client-side classes to be downloaded into the server instances (much like applets do), which allows a developer/researcher to run experiments involving new algorithms that are not part of the weka distribution without having to install them on each server. Unfortunately, Java's security model has made it tricky for folks to get these RMI servers up and running successfully. The new wekaServer package is trivial to get up and running, but lacks a mechanism for downloading new classes into the server. However, the package management system makes deploying additional classes to server instances quite easy.
The features of wekaServer include:
- general purpose task executors
- a cluster can be made by having a server register with a master server as a "slave"
- a server can be both a master and a slave
- simple load balancing
- server-side scheduling of tasks
- tasks are saved to disk pre and post execution
- tasks can return a result object
- clients can talk to just one master for task submission, access to results, status and log info regardless of where a given task gets executed
- GUI Knowledge Flow perspective and command-line interface for remote execution/scheduling of Knowledge Flow processes
- Explorer plugin allowing distributed cross-validation (each fold is a separate task), train/test split etc. of a classifier
The package can be installed through Weka's package manager.
Once installed, the server can be started from the command line:
The default port for Weka servers is 8085, so the port option can be omitted if this port is available.
To start a server that registers with another as a slave:
By default, a server is started with two "execution slots." This means that the server can run up to two tasks simultaneously, with further tasks queued or handed off to slave servers (if any). It makes sense to set the number of execution slots equal to the number of processors or processor cores available on the machine that the server is running on, e.g:
The server will periodically check completed tasks and clean up (purge them) if a certain time has elapsed since they were last executed (1 hour by default). The length of time that must elapse before a finished task is considered "stale" and gets purged can be set with the "staleTime" flag. This argument takes a value in milliseconds. Purging can be disabled completely by specifying a negative value. E.g.
Once running, a server can be monitored using a standard web browser.
By default, no authentication is required to access a server. Basic authentication can be used if desired by placing a "weka.pwd" file in $HOME/wekafiles/server. Plaintext or obfuscated passwords can be used. The format of the file is "username: password".
The wekaServer package includes a Knowledge Flow perspective for remote execution/scheduling of Knowledge Flow data mining processes and a plugin for the Explorer's Classifier panel.
The Knowledge Flow perspective allows multiple flows to be launched/scheduled simultaneously. Logging and status information is retrieved at a user-specified time interval from the server and appears in the UI in the same way as when run locally.
The Explorer classifier panel plugin provides an additional button that enables the current classifier and evaluation configuration to be sent to a remote server for execution. The classifier, all evaluation statistics and visualization data are learned/computed on the server. In the case of cross-validation, each fold is converted to a task and sent to the server. The number of fold tasks that are executed in parallel depends on how many slaves are available and how many execution slots have been configured for each.
6. Performance Considerations
It goes without saying that data mining can be a computationally demanding and resource hungry activity. Therefore, it makes sense to run Weka servers on high spec 64 bit machines with as much memory as possible. The more tasks that run simultaneously the more memory used. The server and client tasks have been designed to conserve memory whenever possible. For example, cross-validation fold tasks used by the Explorer compress their training and test folds before being transmitted to the server. Upon arrival at the server the data is offloaded to disk so as not to consume memory if the task gets queued. Similarly, any result objects (e.g. models resulting from learning) are stored on disk until the client fetches them.
When setting up a cluster using heterogenous machines, differences in processor speed can be accommodated to a certain extent by specifying a load adjust factor. The default value of this factor is 1, which should be used for the fastest machine in the cluster. Slower machines should have a value greater than 1. One way to set a reasonable load adjust factor is to measure the time taken to run a task on the fastest machine and a slower machine. The factor for the slower machine can be set to time_slow/time_fast.