Added by Matt Casters, last edited by Etienne Neveu on Jan 14, 2010  (view change) show comment

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Description

This step type allows you to implement Ralph Kimball's slowly changing dimension for both types: Type I (update) and Type II (insert).
Not only can you use this step to update a dimension table, it may also be used to look up values in a dimension.

In our dimension implementation each entry in the dimension table has the following properties:

Option Description
Technical key This is the primary key of the dimension.
Version field Shows the version of the dimension entry (a revision number).
Start of date range This is the fieldname containing the validity starting date.
End of date range This is the fieldname containing the validity ending date.
Keys These are the keys used in your source systems. For example: customer numbers, product id, etc.
Fields These fields contain the actual information of a dimension.

As a result of the lookup or update operation of this step type, a field is added to the stream containing the technical key of the dimension. In case the field is not found, the value of the dimension entry for not found (0 or 1, based on the type of database) is returned.

Note: This dimension entry is added automatically to the dimension table when the update is first run. If you have "NOT NULL" fields in your table, adding this empty row and then the entire step will fail! So make sure that you have a record with the ID field = 0 or 1 in your table if you don't want PDI to insert a potentially invalid empty record.

In version 3.2.0, we  added the a number of optional fields (in the "Fields" tab) that are automatically managed by the step.  You can specify the table field name in the "Dimension Field" column.  These are the optional fields:

  • Date of last insert or update (without stream field as source) : adds and manges a Date field
  • Date of last insert (without stream field as source) : adds and manges a Date field
  • Date of last update (without stream field as source) : adds and manges a Date field
  • Last version (without stream field as source) : adds and manges a Boolean field. (converted into Char(1) or boolean database data type depending on your database connection settings and availability of such data type)

Functionality

As the name of the step suggests, the functionality of the step falls into 2 categories, Lookup and Update...

Lookup

In read-only mode (update option is disabled), the step only performs lookups in a slowly changing dimension.  The step will perform a lookup in the dimension table on the specified database connection and in the specified schema. To do the lookup it uses not only the specified natural keys (with an "equals" condition) but also the specified "Stream datefield" (see below).  The condition that is applied is:

"Start or table date range" >= "Stream datefield"  AND "End or table date range" < "Stream datefield"

In case no "Stream datefield" is specified we use the current system date to find the correct dimension version record.
Note: If you use an "alternative start date" (Since version 3.2) the SQL clause described above will differ slightly

In case no row is found, we return the "unknown" key.  This will be 0 or 1 depending on whether or not you selected an auto-increment field for the technical key field).  Please note that we don't make a difference between "Unknown", "Not found", "Empty", "Illegal format", etc.  These nuances can be added manually however.  Nothing prevents you from flushing out these types before the data hits this step with a Filter, regular expression, etc.  We suggest you manually add values -1, -2, -3, etc for these special dimension entry cases, just like you would add the specific details of the "Unknown" row prior to population of the dimension table.

Important : because we use SQL to look up the technical key in the dimension table, take care of the following:

  • Please do not use NULL values for your natural key(s).  Null values can't be compared and are not indexes by most databases : even if we would support null values in keys (something that doesn't make a lot of sense anyway) it would most likely cause severe lookup performance problems.
  • Try to be aware of data conversion issues that occur if you have data types in your input streams that are different from the data types in your natural key(s).  If you are have Strings in the steps input and in the database you use an Integer for example, make sure you are capable of converting the String to number.  See it as a best practice to do this before this step to make sure it works as planned.  Another typical example of problems is with floating point number comparisons.  Stay away from those.  We recommend you use sane data types like Integer or long integers.  Stay away from Double, Decimal or catch-all data types like Oracle's Number (without length or precision, it implicitly uses precision 38 causing us to use the slower BigNumber data type)

Update

In update mode (update option is enabled) the step first performs a lookup of the dimension entry as described in the "Lookup" section above.  The result of the lookup is different though.  Not only the technical key is retrieved from the query, but also the dimension attribute fields.  A field-by-field comparison then follows.  The result can be one of the following situations:

  • The record was not found, we insert a new record in the table.
  • The record was found and any of the following is true:
    • One or more attributes were different and had an "Insert" (Kimball Type II) setting: we insert a new dimension record version
    • One or more attributes were different and had a "Punch through" (Kimbal Type I) setting: we update these attributes in all the dimension record versions
    • One or more attributes were different and had an "Update" setting: we update these attributes in the last dimension record version
    • All the attributes (fields) were identical : we don't update or insert anything.

Please note that if you mix Insert, Punch Through and Update options in this step, that this algorithm acts like a Hybrid Slowly Changing Dimension. (it's no longer just Type I or II, it's a combination)

Options

The following table provides a more detailed description of the options for the Dimension Lookup/Update step:

Option Description
Step name Name of the step.

Note: This name has to be unique in a single transformation.

Update the dimension? Check this option if you want to update the dimension based on the information in the input stream. If this option is not enabled, the dimension only does lookups and only adds the technical key field to the streams.
Connection Name of the database connection on which the dimension table resides.
Target schema This allows you to specify a schema name to improve precision in the quoting and allow for table-names with dots '.' in it.
Target table Name of the dimension table.
Commit size Setting this to 10 will generate a commit every 10 inserts or updates.
Caching
  • Enable the cache?  Enable this option if you want to enable data caching in this step. (since version 3.2, set a cache size of >=0 in previous versions or -1 to disable caching)
  • Pre-load cache? Since version 3.2 this allows you to read the complete contents of a dimension table prior to performing lookups (updates are not yet supported) with the sole intention of speeding up lookup performance.  Performance is increased by the elimination of the round trips to the database and by the sorted list lookup algorithm.
  • Cache size in rows :  This is the cache size in number of rows that will be held in memory to speed up lookups by reducing the number of round trips to the database.

    Note: Please note that only the last version of a dimension entry is kept in memory (unless pre-load is enabled). If there are more entries passing than what can be kept in memory, the technical keys with the highest values are kept in memory in the hope that these are the most relevant. 
    Important: A cache size of 0 caches as many rows as possible and until your JVM runs out of memory. Use this option wisely with dimensions that can't grown too large.
    A cache size of -1 means that caching is disabled.

Keys tab Specify the names of the keys in the stream and in the dimension table. This will enable the step to do the lookup.
Fields tab For each of the fields you must have in the dimension, you can specify whether you want the values to be updated (for all versions, this is a Type I operation) or you want to have the values inserted into the dimension as a new version. In the example we used in the screenshot the birth date is something that's not variable in time, so if the birth date changes, it means that it was wrong in previous versions. It's only logical then, that the previous values are corrected in all versions of the dimension entry.
Technical key field This indicates the primary key of the dimension. It is also referred to as Surrogate Key. Use the new name option to rename the technical key after a lookup. For example, if you need to lookup different types of products like ORIGINAL_PRODUCT_TK, REPLACEMENT_PRODUCT_TK, ... 

Note: Please note that renaming technical keys is only possible during lookup mode, not when running in update.

Creation of technical key Specify how the technical key is generated, options which are not available for your connection will be grayed out:
  • Use table maximum + 1: A new technical key will be created from the maximum key in the table. Note that the new maximum is always cached, so that the maximum does not need to be calculated for each new row.
  • Use sequence: Specify the sequence name if you want to use a database sequence on the table connection to generate the technical key (typical for Oracle e.g.).
  • Use auto increment field: Use an auto increment field in the database table to generate the technical key (typical for DB2 e.g.).
Version field Specifies the name of the field to store the version (revision number) in.
Stream Datefield If you have the date at which the dimension entry was last changed, you can specify the name of that field here. It allows the dimension entry to be accurately described for what the date range concerns. If you don't have such a date, the system date will be taken.
When the dimension entries are looked up (Update the dimension is not selected) the date field entered into the stream datefield is used to select the appropriate dimension version based on the date from and date to dates in the dimension record.
Date range start field Specify the names of the dimension entries start range.
Use an alternative start date?
When enabled, you can choose an alternative to the "Min. Year"/01/01 00:00:00 date that is used.  You can use any of the following:
  • System date : use the system date as a variable date/time
  • Start date of transformation : use the system date, taken at start of the transformation for the start date
  • Empty (null) value
  • Column value : you can select a column from which we will take the value.
    Please note that it is possible to create a non-conformed dimension with these options.  Use them wisely.  Not all possibilities make sense!
Table daterange end Specify the names of the dimension entries end range.
Get Fields button Fills in all the available fields on the input stream, except for the keys you specified.
SQL button Generates the SQL to build the dimension and allows you to execute this SQL.

Remarks

  • For the Stream date field: Consider adding an extra date field from System Info if you don't want the date ranges to be different all the time. For example if you have extracts from a source system being done every night at midnight, consider adding date "Yesterday 23:59:59" as a field to the stream by using a Join step.

Important: this needs to be a Date field. We isolate functionality and as such require you to do date type conversions in advance.

  • For the "Date range start and end fields": You can only enter a year in these fields, not a timestamp. If you enter a year YYYY (e.g. 2100), it will be used as timestamp "YYYY-01-01 00:00:00.00" in the dimension table.

Hi ,

I have a table account_dim . With Dimension Lookup-Update ,i check if the account name coming from the source matches with account name field in dimension table. I am able to create new keys.

There are other field like email address which  in the source table need to be inserted in the target . They are ignored . How do insert them.

Thanks

Sushant

Comment: Posted by sushantlikhar at Dec 16, 2008 15:49

Hi Sushant,

We need to add new columns back in the table as well. Then, we can have one-to-one mapping between dimension and stream fields.

Thanks,

Rahul

Comment: Posted by Rahul Kumar Jha at Jul 09, 2009 05:32

Some how when I run the transformation I made (read from a table and adds on a different table the dimensions) a line with the value 0 all along the row gets inserted.

Is the same like and only the tk changes. Same version same natural key. All is 0.

Any idea why this might be happening?

Comment: Posted by Leonardo Honores at Apr 01, 2010 07:50