Access Keys:
Skip to content (Access Key - 0)

Configuring log tables for concurrent access

It can be surprisingly tricky to use a database to generate a unique ID.  The problem is mostly caused by inadequate or incorrect locking by the various databases.  To make matters worse, every database behaves different and even within the same database you can use different database engines (like MySQL) that behave differently when it comes to locking tables.

Note: We are working on an easier solution for future releases, see PDI-5501

The problem

All that Kettle wants to do is generate a unique integer ID at the very start of the transformation or job.  That "batch" ID is then being used in various places throughout the life of the transformation or job.   Unfortunately, generating a unique ID is not something that is handled in any sort of standard fashion by the various database vendors.  Some database don't offer any support for it, some offer "Sequences" to do it and others have special identity or auto-increment columns that get automatically assigned a value upon insertion of a record.

Whatever the case, Kettle opted in the early beginning to stay on the safe path and simply insert look in a log table to calculate the maximum ID to then add one to it.  This works fine until more than one transformation or job runs at the same time.  At that point you can be looking at the very same maximum ID value with 2 different database connections.  This in turn gives you duplicate values in the log tables. 

The solution

Over the years, many attempts were made to properly lock the logging tables but for certain databases this has proven to be quite tricky to get right.  To finally come to a working solution that is also backward compatible the Pentaho engineers came up with the following options...

Use an identity column

You can use an identity or auto-increment column if your database supports this.  To enable support, create a new table like this (MySQL example):

create table LOG_COUNTER(ID INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, NAME VARCHAR(100)) engine=InnoDB;

Insert a record:

insert into LOG_COUNTER(NAME) values('LOG_TABLES');
commit;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options pane:

Parameter: AUTOINCREMENT_SQL_FOR_BATCH_ID
Value: UPDATE LOG_COUNTER SET ID=LAST_INSERT_ID(ID+1)

This will explain to PDI to use a value from the LOG_COUNTER table every time a new batch ID needs to be generated for a transformation or a job table.

Use a SEQUENCE column

You can use a database sequence if your database supports this.  To enable support for this, create a sequence like this (Oracle example):

CREATE SEQUENCE LOGGINGSEQ START WITH 10000;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options panel:

Parameter: SEQUENCE_FOR_BATCH_ID
Value: LOGGINGSEQ

This will explain to PDI to use a value from the LOGGINGSEQ sequence every time a new batch ID needs to be generated for a transformation or a job table.


  1. Jun 28, 2011

    Pedro Alves says:

    Here's a shell script to update all ktr's to use this method: find . -iname ...

    Here's a shell script to update all ktr's to use this method:

    find . -iname *ktr -exec grep -l '<trans-log-table><connection>[^<]' '{}' + | xargs grep -L 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE LOG_COUNTER SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

  2. Oct 05, 2011

    David Kari says:

    Thanks for the posting.  We have been using logging since 4.0, but have not...

    Thanks for the posting.  We have been using logging since 4.0, but have not seen this issue until we upgraded to 4.2 last week.  Then this showed up so I was happy to find your posting.

    The above solution worked perfectly for our MySQL DBs.

    Also thanks Pedro for the script.  I had to modify it for my use because we have whitespace in our pathnames.  My version is shown below.

    # For Transformations
    find pdi_repository -iname *ktr -print0 | xargs -0 grep -l -Z '<trans-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'
    # For Transformations

    find pdi_repository -iname *ktr -print0 | xargs -0 grep -l -Z '<trans-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

    # For JOBS

    find pdi_repository -iname *kjb -print0 | xargs -0 grep -l -Z '<job-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

  3. Jun 19, 2012

    Desislava Georgieva says:

    Thanks for the great posting! It's exactly what I need. I've just tried upd...

    Thanks for the great posting! It's exactly what I need.

    I've just tried updating my DB2 logging connection with the SEQUENCE_FOR_BATCH_ID parameter, just as described, but I get an error when I start a transformation afterwards.

    > Error connecting to database: (using class com.ibm.db2.jcc.DB2Driver)
    > Spoon - ERROR (version 4.2.0-GA, build 15694 from 2011-08-24 11.46.44 by buildguy) : [jcc][10165][10051][3.51.126] Ungültige Syntax der Datenbank-URL: jdbc:db2://xHOSTx:xPORTx/xDBx:SEQUENCE_FOR_BATCH_ID=LOGGINGSEQ. ERRORCODE=-4461, SQLSTATE=42815

    What did I do wrong?

  4. Aug 30, 2012

    Pablo Quiroga says:

    This is problematic for us. We need to build a set of transformations and jobs t...

    This is problematic for us. We need to build a set of transformations and jobs that are compatible with both Oracle and MS SQL Server - without duplicating them.  We don't want to maintain two sets of files, and these get rolled out to multiple customers that have one of the RDBMS just mentioned.

    Everything else on the transformations and jobs work very good, except this Logging thing. And logging to DB is requirement.  Any other workaround/solution you can suggest?

    We're on version 4.3.0-GA build 16753 from 12-04-18.

    -pq

  5. Apr 25, 2013

    Jens Bleuel says:

    We are working on an easier solution for future releases, see PDI-5501

    We are working on an easier solution for future releases, see PDI-5501

  6. May 15, 2013

    Markus Schwab says:

    I tried the latter since I'm logging to a postgresql. Using a file-based repo, I...

    I tried the latter since I'm logging to a postgresql. Using a file-based repo, I added "<attribute><code>EXTRA_OPTION_POSTGRESQL.SEQUENCE_FOR_BATCH_ID</code><attribute>translogseq</attribute></attribute>" to the .kdb file of the connection. I created the sequence, having it start at a number higher than the max id_batch in the trans_log table.

    Now whenever I run a small transformation in order to test this, it does not use the sequence, but just increments the id_batch as it did before. If I add above line to the connection in the .ktr, it works. Wasn't the .kdb file supposed to overwrite whatever's stored in the .ktr?

  7. Nov 05, 2013

    Jens Bleuel says:

    From PDI-10911: For MS-SQL 2008-R2 there is no solution. MS-SQL 2008-R2 doesn't ...

    From PDI-10911: For MS-SQL 2008-R2 there is no solution. MS-SQL 2008-R2 doesn't have SEQUENCEs. They are supported in MS-SQL 2012.

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki