Problems starting spoon
Q: When I start spoon I get the following error: "Could not find the main class. Program will exit". How do I start spoon?
A: You can get the above message because of several reasons, the root cause is always that kettle.jar is not on the classpath.
- Assuming you downloaded the binary version of Pentaho Data Integration: check whether you extracted the zip file maintaining the directory structure: under the main directory there should be a directory called "lib" that contains a file called kettle.jar. If this is not the case re-extract the zip file in the proper way.
- When you fetched the sources of Pentaho Data Integration and compiled yourself you are probably executing the spoon script from the wrong directory. The source distribution has a directory called "bin" that contains the scripts, but if you compile the proper way the "distribution"-ready Pentaho Data Integration will be in a directory called "distrib". You should start the spoon script from that directory.
What's the difference between transformations and jobs?
Q: In Spoon I can make jobs and transformations, what's the difference between the two?
A: Transformations are about moving and transforming rows from source to target. Jobs are more about high level flow control: executing transformations, sending mails on failure, ftp'ing files, ...
Rule on mixing row 'types' on a hop in a transformation
Q: In the manuals I read that row types may not be mixed, what does that mean?
A: Not mixing of rows means that every row which is sent over a single hop needs to be of the same structure: same fieldnames, types, order of fields. So if you want to do stuff like "add an extra field if condition is true for a row but not otherwise" it will not work (because you will get different type of rows depending on conditions). You can switch on "Enable safe mode" to explicitly check for this at runtime.
As of v2.5.0 checking of mixing of rows is also done automatically at design/verify time, but "Enable safe mode" still needs to be switched on to check it at runtime (as this causes a slight processing overhead).
A way to look at this is that a hop is very similar to a database table in some aspects, you also cannot store different type of rows in a database table. Theoretically the reason is that PDI wants to be able to do uniform/consistent transformations on your data and having variable rows makes this much more complex.
Technically most of the steps use optimization techniques which transformation column names into indexes into a row (e.g. The column with name "sid" is column 4), having variable row would make this not possible (and break most of the current steps).
About duplicate fieldnames in a transformation
Q: Why can't I duplicate fieldnames in a single row?
A: You can't. PDI will complain in most of the cases if you have duplicate fieldnames. Before PDI v2.5.0 you were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.
About empty strings and NULL
PDI follows Oracle in its use of empty string and NULLs: they are considered to the same. If you would find a step that doesn't follow this convention, let us knows since it's probably a bug.
How to copy/duplicate a field in a transformation?
Q: How do you duplicate a field in a row in a transformation?
*A:*Several solutions exist:
- Use a "Select Values" step renaming a field while selecting also the original one. The result will be that the original field will be duplicated to another name. It will look as follows:
This will duplicate fieldA to fieldB and fieldC.
- Use a calculator step and use e.g. The NLV(A,B) operation as follows:
This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
How to do a database join with PDI?
Q: How do you do a database join with PDI?
*A:*If you want to join 2 tables from the same database, you use a "Table Input" step and do the join in SQL itself, in 1 database this the fastest solution.
If you want to join 2 tables that are not in the same database. You can use the the "Database Join" step, but you have to realize that PDI will launch a query for each input row from the main stream. You can find more information on this in 1 of the weekly tips.
How to sequentialize transformations?
Q: By default all steps in a transformation run in parallel, how can I make it so that 1 row gets processed completely until the end before the next row is processed?
*A:*This is not possible, one of the basic things in PDI transformations is that all of the steps run in parallel. So you can't sequentialize them. This would require architectural changes to PDI and sequential processing would also result in very slow processing.