Hitachi Vantara Pentaho Community Wiki
Child pages
  • Clustering with Pentaho Data Integration
Skip to end of metadata
Go to start of metadata

The basics

The basics of clustering in Pentaho Data Integration is very simple.  The user configures one or more steps as being configured :

TODO: Insert screenshot of simple clustering transformation.

In other words, clustering information is data integration metadata like before.  Nothing is done with it until we execute the transformation.  The most important part of the metadata is the definition of the slave server...

Slave servers

A slave server is essentially a small embedded webserver.  We use this web-server to control the slave server.  The slave server is started with the Carte program found in your PDI distribution.

The arguments used to start a slave server with Carte are discussed elsewhere but the minimum is always an IP-address or hostname and a HTTP port to communicate.  For example:

sh carte.sh localhost 8080

The slave server metadata can be entered in the Spoon GUI by basically giving your Carte instance a name and entering the same data.

Cluster schema

A cluster schema is essentially a collection of slave servers.   In each collection you need to pick at least one slave server that we will call the Master slave server or master.  The master is also just a carte instance but it takes care of all sort of management tasks across the cluster schema. 

In the Spoon GUI you can enter this metadata as well once you started a couple of slave servers.

Execution process

What happens when you execute a clustered transformation is this: all the steps that have clustering configured will run on the configured slave servers.  The steps that don't have any clustering metadata attached to it run on the single master slave server.  Then the following happens

  1. The users transformation is split up into different parts: 1 part becomes a "master transformation" and N parts become "slave transformation".
  2. The master and slave transformations are sent in XML format down to the slave servers (Carte instances) over HTTP
  3. All the master and slave transformations are initialized (server socket ports are opened as well as files) across the cluster
  4. All the master and slave transformations are started
  5. The parent application (Spoon, Kitchen or the Transformation job entry) monitor the master and slave transformations until finished


  • No labels

3 Comments

  1. Hi Matt,

    Thanks for the information.

    I have one question. I want to make PDI as  a server so that I can access PDI from other system. 

    I have a team of 10 members. I want all of them should build the transformation/jobs on one repository which is present in one machine using their systems as client.

    Please let me know if I can achieve this on Pentaho? If any questions in understanding the problem please let me know. My email id is atul.cancerian@gmail.com

    Thanks

    Atul

  2. Please use the forum to ask these kinds of questions: http://forums.pentaho.com

  3. I would like to see an example of how to set this up through the Spoon UI as we need to use master slave for certain queries that are performed by the ETLs