Mar 31, 2008
Submitted by Matt Casters, Chief of Data Integration, Pentaho
- Table of Contents
- Launch several copies of a step
- Manage thread priorities
- Remove fields in Select Value
- Get Variables
- Text File Input
- Lazy conversion
- Join rows
- Review the big picture: database, commit size, row set size and other factors
- Step Performance Monitoring
To improve the parallel aspect again you can always start multiple copies of the one big step. (see below)
Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A. In version 3.1, we are adding an explicit "create copy of field A" function in the Calculator.
Consider doing conversions between data types (dates, numeric data, etc) in a "Select Values" step (version 3.0.2 or higher). You can do this in the Meta-data tab of the step.
If you have variables that can be declared once at the start of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab).
There are 2 main reasons why launching multiple copies of a step might result in better performance:
- The step has to deal with network latencies and launching multiple copies can reduce the average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output
- Please consider the drawback of running multiple steps in parallel, since they consume more CPU threads.
In versions 3.0.2 and higher, this feature that is found in the "Transformation Settings" dialog (Misc tab) improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.
Reference JIRA case: (fixed in 3.0.2 or 3.1.0)
If you don't need to do this, don't do it. It's an expensive step as the engine needs to reconstruct the complete row.
It's almost always faster to add fields to a row compared to deleting fields from a row.
Apparently this is cause for a slowdown as well in recent versions if you use it in a high-volume stream. (accepting input)
To solve it, take the "Get Variables" step out of the transformation (right click, detach). Then join it back in with a "Join Rows (cart prod)" step. Make sure to specify the main step to read from in the "Join Rows" step. Set it to the step that orriganally provided the "Get Variables" step with data.
It might make a big difference if you use the new "CSV Input" or "Fixed Input" steps as they have been re-written for optimal performance.
If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies)
These new steps have been rewritten with Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be.
In these cases where you are reading data from a text file and you write the data back to a text file, you can use Lazy conversion to speed up the process. The principle behind it is that it delays data conversion with the hope that it doesn't take place at all. (reading from a file and writing it back comes to mind)
Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This in turn will help the internal Kettle engine to do faster data serialization. (sort, clustering, etc)
The Lazy Conversion option is available in the "CSV Input" and "Fixed input" text file reading steps.
You need to specify the main step to read from. This will prevent the step from doing any spooling to disk when this is not needed.
If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This will avoid (slow) spooling to disk.
Always think of the whole environment that influences your performance. There can be limitating factors within the transformation but also in the outside world or Pentaho Data Integration. You can play with the optimal commit size for instance, but this is depending on many factors and we can only give you some background and encourage you to test with. It is depending at least on the following: your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself.
You can try with different commit sizes and also change the number of rows in row sets in your transformation settings or even the buffers in your JDBC drivers or database.
In a lot of cases a smaller row set size actually improves performance since it forces rows through all steps of a (parallel executing) transformation.
As of version 3.1.0 of Pentaho Data Integration, we are introducing a way to track the performance of individual steps in a transformation. This is an important tool that allows you to fine-tune performance of transformation because that performance is determined by the slowest step in the transformation. Further information can be found here.