Hitachi Vantara Pentaho Community Wiki
Access Keys:
Skip to content (Access Key - 0)




Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).


The table below describes the options available for StringToWordVector.

Option Description
IDFTransform Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document (instance) j.
TFTransform Sets whether if the word frequencies should be transformed into:
where fij is the frequency of word i in document (instance) j.
attributeIndices Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values. Specify an inclusive range with "-". E.g: "first-3,5,6-10,last".
attributeNamePrefix Prefix for the created attribute names. (default: "")
doNotOperateOnPerClassBasis If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
invertSelection Set attribute selection mode. If false, only selected attributes in the range will be worked on; if true, only non-selected attributes will be processed.
lowerCaseTokens If set then all the word tokens are converted to lower case before being added to the dictionary.
minTermFreq Sets the minimum term frequency. This is enforced on a per-class basis.
normalizeDocLength Sets whether if the word frequencies for a document (instance) should be normalized or not.
outputWordCounts Output word counts rather than boolean 0 or 1(indicating presence or absence of a word).
periodicPruning Specify the rate (x% of the input dataset) at which to periodically prune the dictionary. wordsToKeep prunes after creating a full dictionary. You may not have enough memory for this approach.
stemmer The stemming algorithm to use on the words.
stopwords The file containing the stopwords (if this is a directory then the default ones are used).
tokenizer The tokenizing algorithm to use on the strings.
useStoplist Ignores all the words that are on the stoplist, if set to true.
wordsToKeep The number of words (per class if there is a class attribute assigned) to attempt to keep.


The table below describes the capabilites of StringToWordVector.

Capability Supported
Class No class, Relational class, Unary class, Binary class, Numeric class, Empty nominal class, Date class, Missing class values, Nominal class, String class
Attributes Relational attributes, Empty nominal attributes, Date attributes, Binary attributes, String attributes, Missing values, Nominal attributes, Unary attributes, Numeric attributes
Min # of instances 0

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder Powered by Atlassian Confluence