Hitachi Vantara Pentaho Community Wiki
Child pages
  • StringToWordVector
Skip to end of metadata
Go to start of metadata

Package

weka.filters.unsupervised.attribute

Synopsis

Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Options

The table below describes the options available for StringToWordVector.

Option

Description

IDFTransform

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document (instance) j.

TFTransform

Sets whether if the word frequencies should be transformed into:
log(1+fij)
where fij is the frequency of word i in document (instance) j.

attributeIndices

Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values. Specify an inclusive range with "-". E.g: "first-3,5,6-10,last".

attributeNamePrefix

Prefix for the created attribute names. (default: "")

doNotOperateOnPerClassBasis

If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).

invertSelection

Set attribute selection mode. If false, only selected attributes in the range will be worked on; if true, only non-selected attributes will be processed.

lowerCaseTokens

If set then all the word tokens are converted to lower case before being added to the dictionary.

minTermFreq

Sets the minimum term frequency. This is enforced on a per-class basis.

normalizeDocLength

Sets whether if the word frequencies for a document (instance) should be normalized or not.

outputWordCounts

Output word counts rather than boolean 0 or 1(indicating presence or absence of a word).

periodicPruning

Specify the rate (x% of the input dataset) at which to periodically prune the dictionary. wordsToKeep prunes after creating a full dictionary. You may not have enough memory for this approach.

stemmer

The stemming algorithm to use on the words.

stopwords

The file containing the stopwords (if this is a directory then the default ones are used).

tokenizer

The tokenizing algorithm to use on the strings.

useStoplist

Ignores all the words that are on the stoplist, if set to true.

wordsToKeep

The number of words (per class if there is a class attribute assigned) to attempt to keep.

Capabilities

The table below describes the capabilites of StringToWordVector.

Capability

Supported

Class

No class, Relational class, Unary class, Binary class, Numeric class, Empty nominal class, Date class, Missing class values, Nominal class, String class

Attributes

Relational attributes, Empty nominal attributes, Date attributes, Binary attributes, String attributes, Missing values, Nominal attributes, Unary attributes, Numeric attributes

Min # of instances

0

  • No labels