Hitachi Vantara Pentaho Community Wiki

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Excerpt Include
Doc_Help
Doc_Help
nopaneltrue

Data Services via the Thin Kettle JDBC driver

The thin Thin Kettle JDBC driver allows a Java Driver provides a means for a Java-based client to query the results of a Kettle transformation remotely using JDBC and SQL.

Available from Pentaho Data Integration Enterprise Edition 5.0  or higher.

Architecture

As with most JDBC drivers, there is a server and a client component to the JDBC driver.
The server is designed to run as a Servlet on the Carte server or the Pentaho Data Integration server.

Image Removed

Where are the data service tables coming from?

You can configure a transformation step to serve as a data service in the "Data Service" tab of the transformation settings dialog:   Image Removed

When such a transformation gets saved there is a new entry created in either the local metastore (under the .pentaho/metastore folder) or in the Enterprise repository on the DI server (/etc/metastore).

As such, Carte or the DI Server will automatically and immediately pick up newly configured data services from the moment your service transformation is saved.
The carte configuration file accepts a <repository> which will also be scanned in addition to the local metastore.

Reminder: the Carte users and passwords are stored in the pwd/kettle.pwd file and the default user is "cluster" with password "cluster".  For the DI server you can use the standard admin or suzy accounts to test with password "password"

Monitoring

During execution of a query, 2 transformations will be executed on the server:

  1. A service transformation, of human design built in Spoon to provide the service data
  2. An automatically generated transformation to aggregate, sort and filter the data according to the SQL query

These 2 transformations will be visible on Carte or in Spoon in the slave server monitor and can be tracked, sniff tested, paused and stopped just like any other transformation.  However, it will not be possible to restart them manually since both transformations are programatically linked.

Example

For this example we open the "Getting Started Transformation" (see the sample/transformations folder of your PDI distribution) and configure a Data Service for the "Number Range" called "gst". (comparable to the screenshot above)

Then we can launch Carte or the Data Integration Server to execute a query against that new virtual database table:

Code Block

SELECT   dealsize, sum(sales) as total_sales, count(*) AS nr
FROM     gst
GROUP BY dealsize
HAVING   count(*) > 20
ORDER BY sum(sales) DESC

This query is being parsed by the server and a transformation is being generated to convert the service transformation data into the requested format: Image Removed
The data which is being injected is originating from the service transformation: Image Removed
So for each executed query you will see 2 transformations listed on the server.

The JDBC Client

The JDBC driver uses the following class:

org.pentaho.di.core.jdbc.ThinDriver

The URL is in the following format:

jdbc:pdi://hostname:port/kettle?option=value&option=value

For Carte, this is an example:

jdbc:pdi://localhost:8084/kettle?debugtrans=/tmp/jdbc.ktr

For the Data Integration server:

jdbc:pdi://localhost:9080/kettle?webappname=pentaho-di

this example is for a the carte configuration file shown above.
The following standard options are available:

  • webappname : the name of the web app (typically pentaho-di on the DI server)  Make sure to specify this in the "Options" sections of the Kettle database connection dialog if you want to connect with PDI to a PDI Server (Carte or the DI server);
  • proxyhostname : the proxy server for the HTTP connection(s)
  • proxyport : the port of the proxy server
  • nonproxyhosts : the hosts (comma seperator) for which not to use a proxy
  • debugtrans : the optional name of a file in which the generated transformation will be stored for debugging purposes (example: /tmp/debug.ktr)
  • debuglog : set to "true" to have the logging text of the remote SQL transformation will be written to the general logging channel once execution is finished.

Parameters for the service transformation can be set with the following format:  PARAMETER_name=value (so with the option name prepended with "PARAMETER_")

SQL Support

Support for the SQL is minimal at the moment.

The following things are supported, please consider everything else unsupported:

  • SELECT:
    • * is expanded to include all rows
    • COUNT(field)
    • COUNT(*)
    • COUNT(DISTINCT field)
    • DISTINCT <fields>
    • IIF( condition, true-value or field, false-value or field)
    • CASE WHEN condition THEN true-value ELSE false-value END
    • Aggregates: SUM, AVG, MIN, MAX
    • Alias both with the "AS" keyword and with one or more spaces seperated, for example SUM(sales) AS "Total Sales" or SUM(sales) TotalSales
    • Constant expressions are possible, see below in the literals section.
    • Calculations on the other hand are not possible yet, perform them in the service transformation for now.
  • FROM
    • Strictly one service name, aliasing is possible
    • You can omit the service name to query from an empty row or you can query from dual, for example "SELECT 1"  or "SELECT 1 FROM dual" are the same.
    • You can specify a schema (default is Kettle) but it is currently ignored.  It will be translated to a namespace in the near future.
  • WHERE
    • nested brackets
    • AND, OR, NOT if preceded by brackets, for example: NOT ( A = 5 OR C = 3 )
    • precedence taken into account
    • Literals (String, Integer)
    • PARAMETER('parameter-name')='value'  (always evaluates to TRUE in the condition)
    • =
    • <
    • >
    • <=, =<
    • >=, =>
    • <>
    • LIKE (standard % and ? wildcards are converted to .* and . regular expressions)
    • REGEX (matches regular expression)
    • IS NULL
    • IS NOT NULL
    • IN ( value, value, value, ... )
    • You can put a condition on the IIF expression or it's alias if one is used. (please use identical string literals for expressions)
  • GROUP BY
    • Group on fields, not IIF() function
  • HAVING
    • Conditions should be placed on the aggregate construct, not the alias
    • Please use identical strings for the expressions, the algorithm is not yet that smart.  In other words, if you use "COUNT( * )" in the SELECT clause you should use the same "COUNT( * ) " expression in the HAVING clause, not "COUNT(*)" or any variant of it.
    • You can place having conditions on aggregations that do not appear in the SELECT clause.
  • ORDER BY
    • You can order on any column in the result or not in the result 
    • You can order on IFF or CASE-WHEN expressions.

Literals: 

  • Strings have single quotes around them, escaping quotes is not yet supported.
  • Wiki Markup
    Dates have square brackets around them and the following formats are supported:&nbsp;\[yyyy/MM/dd HH:mm:ss.SSS\],&nbsp;\[yyyy/MM/dd HH:mm:ss\] and&nbsp;\[yyyy/MM/dd\]
  • Number and BigNumber should have no grouping symbol and the decimal is . (example 123.45)
  • Integers contain only digits
  • Boolean values can be TRUE or FALSE

Limitations

Besides the obviously plentiful limitations in the support for the SQL standard, there are a few noteworthy things to note:

  • Grouping is done using a "Memory Group By" step which keeps all the groups in memory. If you expect large amounts of groups to be used, watch your memory consumption on the server.  We're using the "memory group by" step to avoid doing a costly sort on the data.
  • It is not possible to specify the same field twice in the same SELECT clause. (for whatever reason you might want to do that)
  • calculations and functions like string concatenation and so on is not (yet) supported.

Configuring clients

Clients typically need to following libraries to work:

  • kettle-core
  • commons HTTP client
  • commons code
  • commons lang
  • commons logging
  • commons VFS (1.0)
  • log4j
  • scannotation

These libraries can be found in your data-integration/lib folder with the appropriate version number (e.g. kettle-core-5.0.0.jar).

SQuirreL SQL

Image Removed

Since SQuirrel already contains most needed jar files, configuring it simply done by adding kettle-core.jar as a new driver jar file along with Apache Commons VFS 1.0 and scannotation.jar

BIRT

The following jar files need to be added:

  • kettle-core.jar
  • commons HTTP client
  • commons code
  • commons lang
  • commons logging
  • commons VFS (1.0)
  • log4j
  • scannotation

Image Removed

Pentaho Report Designer

Image Removed

Simply replace the kettle-*.jar files in the lib/ folder with new files from Kettle v5.0-M1 or higher.

Pentaho Schema Workbench

Replace the current kettle-*.jar files with the ones from Kettle v5 or later.

Pentaho Interactive Reporting

Interactive reporting runs off Pentaho Metadata so this advice also works there.

You need a BI Server that uses the PDI 5.0 jar files or you can use an older version and update the kettle-core, kettle-db and kettle-engine jar files in the /tomcat/webapps/pentaho/WEB-INF/lib/ folder

Image Removed

Pentaho Analyses (Mondrian): Analyzer / Saiku / JPivot

See Pentaho Interactive reporting: simply update the kettle-*.jar files in your Pentaho BI Server (tested with 4.1.0 EE and 4.5.0 EE) to get it to work.

Example of patching :

Code Block

matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ rm ./tomcat/webapps/pentaho/WEB-INF/lib/kettle-core-4.3.0-GA.jar
matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ rm ./tomcat/webapps/pentaho/WEB-INF/lib/kettle-engine-4.3.0-GA.jar
matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ rm ./tomcat/webapps/pentaho/WEB-INF/lib/kettle-db-4.3.0-GA.jar
matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ cp /kettle/5.0/lib/kettle-core.jar ./tomcat/webapps/pentaho/WEB-INF/lib/
matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ cp /kettle/5.0/lib/kettle-db.jar ./tomcat/webapps/pentaho/WEB-INF/lib/
matt@kettle:~/pentaho/4.5.0-ee/server/biserver-ee$ cp /kettle/5.0/lib/kettle-engine.jar ./tomcat/webapps/pentaho/WEB-INF/lib/

Screen shot:

Image Removed

Fun fact: Mondrian generates the following SQL for the report shown above:

Code Block

select "Service"."Category" as "c0", "Service"."Country" as "c1", sum("Service"."sales_amount") as "m0" from "Service" as "Service" group by "Service"."Category", "Service"."Country"

Kettle

You can query a remote service transformation with any Kettle v5 or higher client.  You can query the service through the database explorer and the various database steps (for example the Table Input step).

Image Removed

DataCleaner

TODO: ask project owners to change the current old driver class to the new thin one.

Jaspersoft iReport Designer

Partial success as I'm getting some XML parsing errors.  However, adding the aforementioned jar files at least allow you to get back query fields:

Image Removed

To be investigated.

QlikView

http://tiqview.tumblr.com/post/29820190073/stream-data-from-pentaho-kettle-into-qlikview-via-jdbc

Future

The following things are next on the agenda:

...

.
Any Java-based, JDBC-compliant tool, including third-party reporting systems, can use this driver to query a Kettle
transformation using a SQL string via JDBC.

Just in time blending of data from multiple sources for a complete picture:

  • Connect, combine and transform data from multiple sources
  • Query data directly from any transformation
  • Access architected blends with the full spectrum of Pentaho Analytics
  • Manage governance and security of data for on-going accuracy

Image Added

This just in time, architected blending delivers accurate big data analytics based on the blended data. You can connect to, combine, and even transform data from any of the multiple data stores in your hybrid data ecosystem into these blended views, then query the data directly via that view using the full spectrum of analytics in the Pentaho Analytics platform, including predictive analytics.

Panel

Available since Pentaho Data Integration Version 5.0 GA (Enterprise Edition)

Please see the following pages for more information:

Page Tree
root@self
expandCollapseAlltrue
sortposition
excerpttrue
reversefalse