Hitachi Vantara Pentaho Community Wiki
Skip to end of metadata
Go to start of metadata

Description

The Fuzzy Match step finds strings that potentially match using duplicate-detecting algorithms that calculate the similarity of two streams of data. This step returns matching values as a separated list as specified by user-defined minimal or maximal values.

General Tab

The General tab enables you to define the source transformation step, field, and which algorithm to use to match similar strings of data.

Option

Definition

Step name

Name of this step as it appears in the transformation workspace

Lookup step

Identifies the step that contains the fields to match

Lookup field

Identifies the field to match

Main stream field

Identifies the primary stream to match the Lookup field with

Algorithm

Identifies which string-matching algorithm to use---options include Levenshtein, Damerau-Levenshtein, Needleman Wunsch, Jaro, Jaro Winkler, Pair letters similarity, Metaphone, Double Metaphone, SoundEx, or Refined SoundEx

Case sensitive

Identifies if streams can or cannot differ based on the use of uppercase and lowercase letters---only for use with the Levenshtein algorithms

Get closer value

When checked, returns a single result with the highest similarity score---when unchecked, returns all matches that satisfy the minimal and maximal value setting as a separated list, separated by the values separator

Minimum value

Identifies the lowest possible similarity score

Maximal value

Identifies the highest possible similarity score

Values separator

Identifies the string that separate the matches. Only available for specific algorithms and when the Get closer value option is unchecked.

Algorithm Definitions

Within the Algorithm field, there are several options available to compare and match strings.

  • Levenshtein and Damerau-Levenshtein---calculate the distance between two strings by looking at how many edit steps are needed to get from one string to another. The former only looks at inserts, deletes, and replacements. The latter adds transposition. The score indicates the minimum number of changes needed. For instance, the difference between John and Jan would be two; to turn the name John into Jan you need one step to replace the O with an A, and another step to delete the H.
  • Needleman Wunsch---calculates the similarity of two sequences and is mainly used in bioinformatics. The algorithm calculates a gap penalty. The aforementioned example would have a score of negative two.
  • Jaro and Jaro Winkler---calculate a similarity index between two strings. The result is a fraction between zero, indicating no similarity, and one, indicating an identical match.
  • Pair letters similarity---dissects the two strings in pairs and calculates the similarity of the two strings by dividing the number of common pairs by the sum of the pairs from both strings.
  • Metaphone, Double Metaphone, SoundEx, and Refined SoundEx---are phonetic algorithms, which try to match strings based on how they would sound. Each is based on the English language and would not be useful to compare other languages.
    • The Metaphone algorithm returns an encoded value based on the English pronunciation of a given word. The encoded value of the names John and Jan would return the value JN for both names.
    • The Double Metaphone algorithm has fundamental design improvements over its predecessor and uses a more complex ruleset for coding. It can return a primary and a secondary encoded value for a string. The names John and Jan each return metaphone key values of JN and AN.
    • The Soundex algorthim returns a single encoded value for a name that consists of a letter followed by three numerical digits. The letter is the first letter of the name, and the digits encode the remaining consonants.
    • The Refined SoundEx algorithm is an improvement over its predecessor. Encoded values for this algorithm are six digits long, the initial character is encoded, and multiple possible encodings can be returned for a single name. Using this algorithm, the name John returns the values 160000 and 460000, as does the name Jan.

Fields Tab

The Fields tab enables you to define how to return the results of a comparison.

Option

Definition

Match field

Defines the name of the column that contains the comparison value

Value field

Defines the similarity score for which to return a value

You can also specify the list of additional fields to retrieve from the lookup stream.

  • No labels