Added by Matt Casters, last edited by Jens Bleuel on Feb 11, 2014  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Description

This step gets a unique ID from the slave server of your choice.

The step works by asking a slave server for the a range of values.  In other words, the step reserves a range of IDs for a certain given central slave server sequence.  It then increments by one until there are no more values in the range, asks another range, and so forth.

This means that the returned IDs can be sequential, however this is not guaranteed.  The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned.

The last value returned + the size of the range (increment) are stored in a database table to persist the value over several runs of a transformation. 

Context

When processing large amounts of data on multiple distinct systems (for example when running a clustered transformation) you sometimes still need to generate a unique numeric ID for each one. Since a GUID is not guaranteed to be unique and consumes a lot more memory and space compared to a numeric integer (long) ID, you may prefer to use numerical IDs.

A unique ID identifies a piece of information, which is useful for: looking up data, support cases, incremental processing, error handling (recovering from where you left off by looking up the last inserted ID) and so on. Typically you would use a database to generate such an ID, for example using an Oracle sequence. However, there are cases where you don't have a database available (such as when you add a unique ID to a text/log file), when the database doesn't support sequences (column databases, typically), or when retrieving the ID from a database is slow (going back and forth to the database for each row severely limits performance). In all these situations, you need a high-performance way of providing unique IDs to a PDI transformation. The Get ID From Slave Server step is designed to fit this need.

Note: The "old way" of generating a unique ID locally was with the Add Sequence step. This is fast, but not unique across multiple systems. Getting a unique ID from a remote database sequence (also possible with the Add Sequence step) will work, but it will be slow and won't scale. Get ID From Slave Server uses a hybrid approach by getting value ranges from a central server which keeps track of the ranges for the various slave sequences, with the ability to use them locally.

Assuming you have (or are now configuring) Carte slave servers set up in PDI, this step retrieves a unique ID from the Carte slave server of your choice. It works by asking a slave server for a range of values. In other words, the step reserves a range of IDs for a certain given central slave server sequence. It then increments by one until there are no more values in the range, asks another range, and so forth. This means that the returned IDs can be sequential, however this is not guaranteed. The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned. The last value returned plus the size of the range (increment) are stored in a database table to persist the value over several runs of a transformation.

Options

Option Definition
Step name The name of this step as it appears in the transformation workspace.
Name of value The name of the (Integer type) output field (sequence or ID)
Slave server The slave server to get the unique ID range from. This can be specified using a variable
Sequence name The name of the sequence as defined on the slave server (see below). The sequence name can be specified using a variable
Increment or batch name The number of IDs to generate before a new value (range) is obtained from the slave server. The higher you make the increment size, the faster the step can run. However, it will deplete the available IDs (1x10^15) faster. For example, if you take 1,000,000,000 as the increment, you can reserve 1,000,000 times a range of IDs. If you only use up a few of those ID each time a transformation runs, you will run out of IDs anyway after 1M executions. Don't make this value too high; keep it in the 1000-10000 range to achieve speeds of over 100,000 rows/sec with this step. This parameter can be specified using a variable

Slave server configuration

You need to start your slave server with extra connection and sequence information configured in an XML file. The extra connection and sequences blocks are what make the sequences work in the following example:

<slave_config>

  <slaveserver>
    <name>master1</name>
    <hostname>localhost</hostname>
    <port>8282</port>
    <master>Y</master>
  </slaveserver>

  <connection>
    <name>MySQL</name>
    <server>localhost</server>
    <type>MYSQL</type>
    <access>Native</access>
    <database>test</database>
    <port>3306</port>
    <username>matt</username>
    <password>Encrypted 2be98afc86aa7f2e4cb79ce10df90acde</password>
  </connection>

  <sequences>

   <sequence>
    <name>test</name>
    <start>0</start>
    <connection>MySQL</connection>
    <schema/>
    <table>SEQ_TABLE</table>
    <sequence_field>SEQ_NAME</sequence_field>
    <value_field>SEQ_VALUE</value_field>
   </sequence>

  </sequences>

The start tag is optional and will default to 0 if you leave it out of the definition. You can define as many sequences as you like.

Servlet Information

Once the configuration files are changed as shown above, slave servers receive a new servlet that can be called as follows (authentication required):http://hostname:port/kettle/nextSequence/?name=SequenceName&increment=NumberOfIDsToReserveIn
case no increment is specified, 10000 IDs are reserved, for example:http://localhost:8282/kettle/nextSequence/?name=test
The servlet will return a simple piece of XML containing both the start of the range as well as the number of IDs reserved, or the increment:

<seq><value>570000</value><increment>10000</increment></seq>

Continuing with this example, the following row will be present in the SEQ_TABLE table:

mysql> select * from SEQ_TABLE where SEQ_NAME='test';
SEQ_NAME SEQ_VALUE
test 580000

Note: You must create the sequence table with the sequence field (<sequence_field>) as a primary key. This will guarantee uniqueness and allow for fast lookup of the sequence value (<value_field>).

Automatic Loading and Creation

It can be a burden to maintain all your sequences in an XML file. Because of this, it is also possible to automatically load all the sequences from a database table. You can use the following construct to do it:

<autosequence>
    <connection>MySQL</connection>
    <schema/>
    <start>1234</start>
    <table>SEQ_TABLE</table>
    <sequence_field>SEQ_NAME</sequence_field>
    <value_field>SEQ_VALUE</value_field>

    <autocreate>N</autocreate>
</autosequence>

The <autocreate> tag allows any sequence name to be specified in the step without error. In that case, the sequence with the name specified will be created automatically with the start value from the <autosequence> specification.

Note: You should disable auto-creation of slave sequences in a production environment.