Hitachi Vantara Pentaho Community Wiki
Child pages
  • SimpleKMeans
Skip to end of metadata
Go to start of metadata

Package

weka.clusterers

Synopsis

Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:

D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.

Options

The table below describes the options available for SimpleKMeans.

Option

Description

displayStdDevs

Display std deviations of numeric attributes and counts of nominal attributes.

distanceFunction

The distance function to use for instances comparison (default: weka.core.EuclideanDistance).

dontReplaceMissingValues

Replace missing values globally with mean/mode.

fastDistanceCalc

Uses cut-off values for speeding up distance calculation, but suppresses also the calculation and output of the within cluster sum of squared errors/sum of distances.

initializeUsingKMeansPlusPlusMethod

Initialize cluster centers using the probabilistic farthest first method of the k-means++ algorithm

maxIterations

set maximum number of iterations

numClusters

set number of clusters

preserveInstancesOrder

Preserve order of instances.

seed

The random number seed to be used.

Capabilities

The table below describes the capabilities of SimpleKMeans.

Capability

Supported

Class

No class

Attributes

Nominal attributes, Numeric attributes, Missing values, Binary attributes, Empty nominal attributes, Unary attributes

Min # of instances

1

  • No labels