Added by Matt Casters, last edited by Jens Bleuel on Apr 25, 2013  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Configuring log tables for concurrent access

It can be surprisingly tricky to use a database to generate a unique ID.  The problem is mostly caused by inadequate or incorrect locking by the various databases.  To make matters worse, every database behaves different and even within the same database you can use different database engines (like MySQL) that behave differently when it comes to locking tables.

Note: We are working on an easier solution for future releases, see PDI-5501

The problem

All that Kettle wants to do is generate a unique integer ID at the very start of the transformation or job.  That "batch" ID is then being used in various places throughout the life of the transformation or job.   Unfortunately, generating a unique ID is not something that is handled in any sort of standard fashion by the various database vendors.  Some database don't offer any support for it, some offer "Sequences" to do it and others have special identity or auto-increment columns that get automatically assigned a value upon insertion of a record.

Whatever the case, Kettle opted in the early beginning to stay on the safe path and simply insert look in a log table to calculate the maximum ID to then add one to it.  This works fine until more than one transformation or job runs at the same time.  At that point you can be looking at the very same maximum ID value with 2 different database connections.  This in turn gives you duplicate values in the log tables. 

The solution

Over the years, many attempts were made to properly lock the logging tables but for certain databases this has proven to be quite tricky to get right.  To finally come to a working solution that is also backward compatible the Pentaho engineers came up with the following options...

Use an identity column

You can use an identity or auto-increment column if your database supports this.  To enable support, create a new table like this (MySQL example):

create table LOG_COUNTER(ID INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, NAME VARCHAR(100)) engine=InnoDB;

Insert a record:

insert into LOG_COUNTER(NAME) values('LOG_TABLES');
commit;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options pane:

Parameter: AUTOINCREMENT_SQL_FOR_BATCH_ID
Value: UPDATE LOG_COUNTER SET ID=LAST_INSERT_ID(ID+1)

This will explain to PDI to use a value from the LOG_COUNTER table every time a new batch ID needs to be generated for a transformation or a job table.

Use a SEQUENCE column

You can use a database sequence if your database supports this.  To enable support for this, create a sequence like this (Oracle example):

CREATE SEQUENCE LOGGINGSEQ START WITH 10000;

In the logging database connection in Pentaho Data Integration (Spoon), add the following line in the Options panel:

Parameter: SEQUENCE_FOR_BATCH_ID
Value: LOGGINGSEQ

This will explain to PDI to use a value from the LOGGINGSEQ sequence every time a new batch ID needs to be generated for a transformation or a job table.

Here's a shell script to update all ktr's to use this method:

find . -iname *ktr -exec grep -l '<trans-log-table><connection>[^<]' '{}' + | xargs grep -L 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE LOG_COUNTER SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

Comment: Posted by Pedro Alves at Jun 28, 2011 07:20

Thanks for the posting.  We have been using logging since 4.0, but have not seen this issue until we upgraded to 4.2 last week.  Then this showed up so I was happy to find your posting.

The above solution worked perfectly for our MySQL DBs.

Also thanks Pedro for the script.  I had to modify it for my use because we have whitespace in our pathnames.  My version is shown below.

# For Transformations
find pdi_repository -iname *ktr -print0 | xargs -0 grep -l -Z '<trans-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'
# For Transformations

find pdi_repository -iname *ktr -print0 | xargs -0 grep -l -Z '<trans-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

# For JOBS

find pdi_repository -iname *kjb -print0 | xargs -0 grep -l -Z '<job-log-table><connection>[^<]' '{}' |  xargs -0 grep -L -Z 'AUTOINCREMENT_SQL_FOR_BATCH_ID' | xargs -0 perl -pi -e 's#<attributes>#<attributes><attribute><code>EXTRA_OPTION_MYSQL.AUTOINCREMENT_SQL_FOR_BATCH_ID</code><attribute>UPDATE etl_log_counter SET ID=LAST_INSERT_ID(ID+1)</attribute></attribute>#'

Comment: Posted by David Kari at Oct 05, 2011 16:58

Thanks for the great posting! It's exactly what I need.

I've just tried updating my DB2 logging connection with the SEQUENCE_FOR_BATCH_ID parameter, just as described, but I get an error when I start a transformation afterwards.

> Error connecting to database: (using class com.ibm.db2.jcc.DB2Driver)
> Spoon - ERROR (version 4.2.0-GA, build 15694 from 2011-08-24 11.46.44 by buildguy) : [jcc][10165][10051][3.51.126] Ung├╝ltige Syntax der Datenbank-URL: jdbc:db2://xHOSTx:xPORTx/xDBx:SEQUENCE_FOR_BATCH_ID=LOGGINGSEQ. ERRORCODE=-4461, SQLSTATE=42815

What did I do wrong?

Comment: Posted by Desislava Georgieva at Jun 19, 2012 11:21

This is problematic for us. We need to build a set of transformations and jobs that are compatible with both Oracle and MS SQL Server - without duplicating them.  We don't want to maintain two sets of files, and these get rolled out to multiple customers that have one of the RDBMS just mentioned.

Everything else on the transformations and jobs work very good, except this Logging thing. And logging to DB is requirement.  Any other workaround/solution you can suggest?

We're on version 4.3.0-GA build 16753 from 12-04-18.

-pq

Comment: Posted by Pablo Quiroga at Aug 30, 2012 23:37

We are working on an easier solution for future releases, see PDI-5501

Comment: Posted by Jens Bleuel at Apr 25, 2013 04:40

I tried the latter since I'm logging to a postgresql. Using a file-based repo, I added "<attribute><code>EXTRA_OPTION_POSTGRESQL.SEQUENCE_FOR_BATCH_ID</code><attribute>translogseq</attribute></attribute>" to the .kdb file of the connection. I created the sequence, having it start at a number higher than the max id_batch in the trans_log table.

Now whenever I run a small transformation in order to test this, it does not use the sequence, but just increments the id_batch as it did before. If I add above line to the connection in the .ktr, it works. Wasn't the .kdb file supposed to overwrite whatever's stored in the .ktr?

Comment: Posted by Markus Schwab at May 15, 2013 12:37

From PDI-10911: For MS-SQL 2008-R2 there is no solution. MS-SQL 2008-R2 doesn't have SEQUENCEs. They are supported in MS-SQL 2012.

Comment: Posted by Jens Bleuel at Nov 05, 2013 11:23