Hitachi Vantara Pentaho Community Wiki
Child pages
  • Amazon EMR Job Executor
Skip to end of metadata
Go to start of metadata


(warning) PLEASE NOTE: This documentation applies to Pentaho 8.0 and earlier. For Pentaho 8.1 and later, see Amazon EMR Job Executor on the Pentaho Enterprise Edition documentation site.

This job entry executes Hadoop jobs on an Amazon Elastic MapReduce (EMR) account. In order to use this step, you must have an Amazon Web Services (AWS) account configured for EMR, and a premade Java JAR to control the remote job.

Option

Definition

Name

The name of this Amazon EMR Job Executer step instance.

EMR Job Flow Name

The name of the Amazon EMR job flow (series of steps) you are executing.

Existing Job Flow ID

Indicates the ID for the existing job flow.  This field is optional.

AWS Access Key

Your Amazon Web Services access key.

AWS Secret Key

Your Amazon Web Services secret key.

S3 Staging Directory

The Amazon Simple Storage Service (S3) address of the working directory for this Hadoop job. This directory will contain the MapReduce JAR, and log files will be placed here as they are created.

MapReduce JAR

The Java JAR that contains your Hadoop mapper and reducer classes. The job must be configured and submitted using a static main method in any class in the JAR.

Command line arguments

Any command line arguments that must be passed to the static main method in the specified JAR.

Number of Instances

The number of Amazon Elastic Compute Cloud (EC2) instances you want to assign to this job.

Master Instance Type

The Amazon EC2 instance type that will act as the Hadoop "master" in the cluster, which handles MapReduce task distribution.

Slave Instance Type

The Amazon EC2 instance type that will act as one or more Hadoop "slaves" in the cluster. Slaves are assigned tasks from the master. This is only valid if the number of instances is greater than 1.

Enable Blocking

Forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status. If left unchecked, the Hadoop job is blindly executed, and PDI moves on to the next step. Error handling/routing will not work unless this option is checked.

Logging Interval

Number of seconds between log messages.