Added by Matt Casters, last edited by Matt Casters on Jun 27, 2010  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Description

This step allows you to store information in a junk-dimension table, and can possibly also be used to maintain Kimball pure Type 1 dimensions.

In short what it will do is:

  1. Lookup combination of business key field1... fieldn from the input stream in a dimension table;
  2. If this combination of business key fields exists, return its technical key (surrogate id);
  3. If this combination of business key doesn't exist yet, insert a row with the new key fields and return its (new) technical key;
  4. Put all input fields on the output stream including the returned technical key, but remove all business key fields if "remove lookup fields" is true.

So what this step does is create/maintain a technical key out of data with business keys. After passing through this step all of the remaining data changes for the dimension table can be made as updates, as either a row for the business key already existed or was created. This step will only maintain the key information, you must update the non-key information in the dimension table, e.g. by putting an update step (based on technical key) after the combination update/lookup step.

Kettle will store the information in a table where the primary key is the combination of the business key fields in the table. Because this process can be very slow in case you have a large number of fields, Kettle also supports a "hash code" field representing all fields in the dimension. This can speed up lookup performance dramatically while limiting the fields to index to 1.

Options

Option Description
Step name Name of the step.

Note: This name has to be unique in a single transformation.

Connection Name of the database connection on which the dimension table resides.
Target schema This allows you to specify a schema name to improve precision in the quoting and allow for table-names with dots '.' in it.
Target table Name of the dimension table.
Commit size Setting this to 10 will generate a commit every 10 inserts or updates.
Cache size in rows This is the cache size in number of rows that will be held in memory to speed up lookups by reducing the number of round trips to the database.

Note: Please note that only the last version of a dimension entry is kept in memory. If there are more entries passing than what can be kept in memory, the technical keys with the highest values are kept in memory in the hope that these are the most relevant.

A cache size of 0 caches as many rows as possible and until your JVM runs out of memory. Use this option wisely with dimensions that can't grown too large.
A cache size of -1 means that caching is disabled.
Key fields Specify the names of the keys in the stream and in the dimension table. This will enable the step to do the lookup.
Technical key field This indicates the primary key of the dimension. It is also referred to as Surrogate Key.
Creation of technical key Specify howthe technical key is generated, options which are not available for your connection will be grayed out:
  • Use table maximum + 1: A new technical key will be created from the maximum key in the table. Note that the new maximum is always cached, so that the maximum does not need to be calculated for each new row.
  • Use sequence: Specify the sequence name if you want to use a database sequence on the table connection to generate the technical key (typical for Oracle e.g.).
  • Use auto increment field: Use an auto increment field in the database table to generate the technical key (typical for DB2 e.g.).
Remove lookup fields? Enable this option if you want to remove all the lookup fields from the input stream in the output. The only extra field added is then the technical key.
Use hashcode This option allows you to generate a hash code, representing all values in the key fields in a numerical form (a signed 64 bit integer). This hash code has to be stored in the table.

Important: This hash code is NOT unique. As such it makes no sense to place a unique index on it.

Table daterange end Specify the names of the dimension entries end range.
Get Fields button Fills in all the available fields on the input stream, except for the keys you specified.
SQL button Generates the SQL to build the dimension and allows you to execute this SQL.

Remarks

  • The Combination Lookup/Update step assumes that the dimension table it maintains is not updated concurrently by other transformations/applications. When you use e.g. the "Table Max + 1" method to create the technical keys the step will not always go to the database to retrieve the next highest technical key. The technical will be cached locally, so if multiple transformations would update the dimension table simultaneously you will most likely get errors on duplicate technical keys. Using a sequence or an auto increment technical key to generate the technical key it is still not advised to concurrently do updates to a dimension table because of possible conflicts between transformations.
  • It is assumed that the technical key is the primary key of the dimension table or at least has a unique index on it. It's not 100% required but if a technical key exists multiple times in the dimension table the result for the Combination Lookup/Update step is unreliable.

Perhaps the remarks can be clarified in terms of what must be avoided or what is OK?

Specifically:

1. In refernence to "... is not updated concurrently" - is it OK to have two of these steps within the same transformation where both reference the same dimension table?

2. In reference to "... by other transformations" - does this include sub-transformations (aka Mapping) called by a parent transformation? How about a parent transformation invoking two different mappings, each containing a step that references the same dimension table?

I am hoping that more than one of these steps can be specified anywhere within a transformation or any sub-transformations (all that reference the same dimention table, that is) - otherwise it would force highly sequential processing of different data types that happen to share a common dimension.

Thanks in advance for clarifying to me and other readers!

Comment: Posted by Jonathan Owen at Jun 19, 2009 21:07