This step gets a unique ID from the slave server of your choice.
The step works by asking a slave server for the a range of values. In other words, the step reserves a range of IDs for a certain given central slave server sequence. It then increments by one until there are no more values in the range, asks another range, and so forth.
This means that the returned IDs can be sequential, however this is not guaranteed. The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned.
The last value returned + the size of the range (increment) are stored in a database table to persist the value over several runs of a transformation.
Since version: 4.2.0GA
When processing large amounts of data on multiple distinct systems (for example when running a clustered transformation) you sometimes still need to generate a unique numeric ID for each. Since a GUID is not guaranteed to be unique and consumes a lot more memory and space compared to a numeric integer (long) ID, you may prefer to use numerical IDs.
A unique ID identifies a piece of information, which is useful for: looking up data, support cases, incremental processing, error handling (recovering from where you left off by looking up the last inserted ID) and so on. Typically you would use a database to generate such an ID, for example using an Oracle sequence. However, there are cases where you don't have a database available (such as when you add a unique ID to a text/log file), when the database doesn't support sequences (column databases, typically) or when retrieving the ID from a database is slow (going back and forth to the database for each row severely limits performance). In all these situations, you need a high-performance way of providing unique IDs to a PDI transformation. The Get ID From Slave Server step is designed to fit this need.
The "old way" of generating a unique ID locally was with the Add Sequence step. This is fast, but not unique across multiple systems. Getting a unique ID from a remote database sequence (also possible with the Add Sequence step) will work, but it will be slow and won't scale. Get ID From Slave Server uses a hybrid approach by getting value ranges from a central server which keeps track of the ranges for the various slave sequences, with the ability to use them locally.
- Name of value : the name of the (Integer type) output field (sequence or ID)
- Slave server: the slave server to get the unique ID range from. This can be specified using a variable.
- Sequence name: the name of the sequence as defined on the slave server (see below). The sequence name can be specified using a variable.
- Increment or batch size: the number of IDs to generate before a new value (range) is obtained from the slave server. The higher you make the increment size, the faster the step can run. However, it will deplete the available IDs (1x10^15) faster. For example, if you take 1,000,000,000 as the increment you can reserve 1,000,000 times a range of IDs. If you only use up a few of those ID each times a transformation runs, you will run out of IDs anyway after 1M executions. So make sure not to make this value too high. Keep the increment value in the 1000-10000 range and you should be able to achieve speeds of over 100,000 rows/sec with the step. This parameter can be specified using a variable.
You need to start your slave server with extra connection and sequence information configured in an XML file. The extra connection and sequences blocks are what make the sequences work in the following example:
As you can see, the definition of the sequence is fairly simple. You can copy the connection XML from Spoon by right clicking on the database connection name in the left hand tree.
The "start" tag is optional and will default to 0 in case you leave it out of the definition.
You can define as many sequences as you like.
Slave servers received a new servlet that can be called as follows (authentication required):
In case no increment is specified, 10000 IDs are reserved, for example:
The servlet will return a simple piece of XML containing both the start of the range as well as the number of IDs reserved or the increment:
In our example, the following row will be present in the SEQ_TABLE table:
Please make sure to create the sequence table with the sequence field (<sequence_field>) as a primary key. This will guarantee uniqueness and allow for fast lookup of the sequence value (<value_field>)
It can be a burden to maintain all your sequences in an XML file. Because of this it is also possible to automatically load all the sequences from a database table. You can use the following construct to do it:
The <autocreate> tag allows any sequence name to be specified in the step without error. In that case the sequence with the name specified will be created automatically with the start value from the <autosequence> specification.
We recommend that you disable auto-creation of slave sequences in a production environment.