This step allows you to enter User Defined Java Class to drive the functionality of a complete step. In essence, this step allows you to program your own plugin in a step.
The goal of the "User Defined Java Class" step is not to allow a user to do full-scale Java development inside of a step. Obviously we have a whole plugin system available to help with that part. (see: The PDI SDK)
The goal is to allow users to define methods and logic with as little as code as possible, executed as fast as possible. For this we use the Janino project libraries that compile Java code in the form of classes at runtime.
The first thing to know is that Janino and as a consequence this step doesn't need the complete Java class, only the class body: the imports, constructors and methods you need. So to drive the point home, the step doesn't need the full class declaration. The developers of this step selected this approach over the definition of the full class since it was possible to hide a lot of technical details and methods from the user this way.
Kettle adds the following imports:
If you need others you need to include them yourself at the very top of your code, for example:
Another thing to note is that Janino, essentially a Java byte-code generator only supports a sub-set of the Java 1.5 specification. To see a complete list of the features and limitations, please go to the Janino homepage. At the time of writing the most apparent limitation is the absence of generics.
Again, if you need to do a lot of Java development we advice you do this in a Java IDE like Eclipse, not inside this step. You can always expose your Java code to this step by throwing it in a jar file and by placing that library the classpath of Kettle (try the libext/ folder).
Most of the time, working with input and output fields is the most important thing you'll be doing in your UDJC code. As such, there are a number of ways to handle the manipulation of fields. To start with let's look at the description of the input row:
The "inputRowMeta" object contains the metadata of the input row. This includes all the fields, their data types, lengths, names, format masks and much more. You can use this to look up input fields and much more. For example, if you want to look for a field called named "customer" you use the following code:
Because looking up field names is slow if you need to do it for every row that passes through a transformation, we advice you to look up field names in advance in a first block like this (in the processRow() method):
To get your hands on the Integer value contained in field "year" you can then use the following construct:
To make this process easier you can use a shortcut in this form:
This method will also take into account the index based optimization mentioned above.
The Java data types that you get from previous steps always corresponds to the Kettle data type as described on the PDI Rows Of Data page.
You can define all the new fields you want in the output of the step in the "Fields" section of the steps dialog:
Doing this will automatically calculate the layout of the output row metadata and store it in "data.outputRowMeta". That in turn allows you to create the output row. In case the step writes as many (or less) rows as it reads, you can simply resize the row you get on input:
or more memorable:
If rows are being copied make sure to create separate copies to prevent subsequent steps from modifying the same Object copy many times at once:
Similar to accessing input fields, output fields can be addressed through the index in the output row or using the field helper.
Using the index you can set a value like this:
or like this with the shortcut:
The Java data types that you pass on to next steps always needs to correspond to the Kettle data type as described on the PDI Rows Of Data page.
or pass on to next steps can't be just anything but needs to correspond to
Because it is not a very good practice to hard-code string values like field-names (for example "customer" in the paragraph above) we allow the usage of parameters in this step:
In this example, taken from your Kettle distribution file "samples/transformations/User Defined Java Class - Calculate the date of Easter.ktr", we have a parameter called YEAR that is referenced with the getParameter() method, for example:
At runtime this will return the "year" String value.
The processRow() method is the heart of the step. This method is called by the transformation in a tight loop and will continue until false is returned. A very simple example that calculates firstname+" "+lastname and stores it into a "name" field is this:
getRow() method must be called before the first get(Fieds.in, FIELD_NAME) - that helps to avoid situations with unexpected fields ordering in the data obtained from the previous step (such as Mapping input specification).
Look int the samples/transformations folder of your Kettle/PDI distribution for files starting with "User Defined Java Class" like "User Defined Java Class - Calculate the date of Easter.ktr".
As GetRow() method returns first row from any input stream( either input stream or info stream), and the only possible and reasonable use of Info steps - is that input rowMeta and info rowMeta varies.
So the adopted approach is to read/get all data from info stream before calling getRow() method. (See example or issues: PDI-8738 and PDI-8740)
When getting parameters that point to transformation parameters, the UDJC behaves differently depending on when the getVariable function is called: if in the init() method, everything works fine; if on initialization of a class member variable, the variable gets not resolved by design. (see PDI-8963)
It is necessary to implement logging yourself. This is because you may wish to log read, written, output, updated etc. Other steps log like so:
Blog about this step and it's usage different scenarios: http://type-exit.org/adventures-with-open-source-bi/2010/10/the-user-defined-java-class-step