Hitachi Vantara Pentaho Community Wiki

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Include Page
NavPanel
NavPanel

Pentaho Big Data Plugin
Div
stylefloat:right
Image Added

The Pentaho Big Data Plugin Project provides support for an ever-expanding Big Data community within the Pentaho ecosystem. It is a plugin for the Pentaho Kettle engine which can be used within Pentaho Data Integration (Kettle), Pentaho Reporting, and the Pentaho BI Platform.

Pentaho Big Data Plugin Features

This project contains the implementations for connecting to or preforming the following:

  • Pentaho MapReduce: visually design MapReduce jobs as Kettle transformations
  • HDFS File Operations: Read/write directly from any Kettle step. All made possible by the ubiquitous use of Apache VFS throughout Kettle
  • Data Sources
    • JDBC connectivity
      • Apache Hive
    • Native RPC connectivity for reading/writing
      • Apache HBase
      • Cassandra
      • MongoDB
      • CouchDB

Key Links

...

  • CI:

...

...

...

Community and where to find help

The Big Data Forum exists for both users and developers. The community also manages the ##pentaho IRC channel on irc.freenode.net.

...

Quick

...

Start:

...

Building

...

the

...

project

...

The

...

Pentaho

...

Big

...

Data

...

Plugin

...

is

...

now

...

a

...

maven

...

project

...

.

...

Please

...

refer

...

to

...

the

...

project

...

readme

...

for

...

build

...

information.

...

Debugging

We recommend providing unit tests where possible and debugging your code through them.

Remote Debugging

If you want to see your code executing within Spoon we recommend remote debugging. This approach can be used with Pan, Kitchen, or the BA/DI Server as well. The workflow is as follows:

  1. Download/Checkout Kettle (currently at 4.4.0-SNAPSHOT)

...

    1. CI

...

    1. Build:

...

    1. http://ci.pentaho.com/job/Kettle-4.4/

...

    1. SVN

...

    1. Source:

...

    1. svn://source.pentaho.org/svnkettleroot/Kettle/branches/4.4.0)

...

  1. Configure

...

  1. the

...

  1. big

...

  1. data

...

  1. plugin's

...

  1. kettle.dist.dir

...

  1. property

...

  1. via

...

  1. override.properties

...

  1. :

...

    1. Create override.properties

...

    1. in

...

    1. the

...

    1. root

...

    1. of

...

    1. the

...

    1. big-data-plugin.

...

    1. This

...

    1. file

...

    1. is

...

    1. a

...

    1. local

...

    1. override

...

    1. for

...

    1. any

...

    1. properties

...

    1. defined

...

    1. build.properties.

...

    1. Add

...

    1. the

...

    1. property:

...

    1. kettle.dist.dir

...

    1. and

...

    1. point

...

    1. it

...

    1. to

...

    1. your

...

    1. Kettle

...

    1. install

...

    1. dir

...

    1. based

...

    1. on

...

    1. if

...

    1. you're

...

    1. using

...

    1. the

...

    1. CI

...

    1. download

...

    1. or

...

    1. building

...

    1. from

...

    1. source:

...

      1. CI

...

      1. Download:

...

      1. kettle.dist.dir=../data-integration

...

      1. Building

...

      1. from

...

      1. source:

...

      1. kettle.dist.dir=../Kettle/distrib

...

      1. (Note:

...

      1. You

...

      1. must

...

      1. build

...

      1. kettle

...

      1. with

...

      1. `ant

...

      1. distrib`

...

      1. before

...

      1. being

...

      1. able

...

      1. to

...

      1. launch

...

      1. it

...

      1. when

...

      1. using

...

      1. the

...

      1. source.

...

      1. This

...

      1. will

...

      1. build

...

      1. Kettle

...

      1. into

...

      1. Kettle/distrib.

...

      1. For

...

      1. more

...

      1. information

...

      1. see

...

      1. PDI

...

      1. Developer

...

      1. Information

...

      1. )
  1. Build and "install"

...

  1. the

...

  1. plugin

...

  1. into

...

  1. Kettle

...

  1. with

...

  1. ant:

...

  1. ant

...

  1. resolve

...

  1. install-plugin

...

  1. (you

...

  1. can

...

  1. drop

...

  1. the

...

  1. resolve

...

  1. after

...

  1. the

...

  1. first

...

  1. build

...

  1. unless

...

  1. the

...

  1. dependencies

...

  1. change)

...

  1. Launch

...

  1. Kettle

...

  1. with

...

  1. remote

...

  1. debugging

...

  1. and

...

  1. attach

...

  1. Eclipse

...

  1. to

...

  1. the

...

  1. process

...

    1. Configure

...

    1. the

...

    1. script

...

    1. you're

...

    1. using

...

    1. to

...

    1. start

...

    1. Spoon

...

    1. (Mac

...

    1. OS

...

    1. X

...

    1. uses

...

    1. the

...

    1. Data

...

    1. Integration

...

    1. 64-bit/Contents/Info.plist

...

    1. :

...

      1. Add -Xdebug

...

      1. -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005

...

      1. to

...

      1. the

...

      1. JVM

...

      1. Arguments

...

        1. spoon.sh

...

        1. or

...

        1. Spoon.bat

...

        1. :

...

        1. Update

...

        1. line

...

        1. 158

...

        1. and

...

        1. add

...

        1. the

...

        1. above

...

        1. JVM

...

        1. arguments

...

        1. to

...

        1. the

...

        1. OPT

...

        1. variable
        2. Data Integration 64-bit/Contents/Info.plist

...

        1. :

...

        1. Update

...

        1. the

...

        1. VMOptions

...

        1. property

...

        1. and

...

        1. append

...

        1. the

...

        1. above

...

        1. JVM

...

        1. arguments

...

  1. Start

...

  1. Spoon

...

  1. Connect

...

  1. to

...

  1. the

...

  1. JVM

...

  1. with

...

  1. Eclipse's

...

  1. Remote

...

  1. Java

...

  1. Application

...

  1. debug

...

  1. configuration,

...

  1. using

...

  1. the

...

  1. socket

...

  1. attach

...

  1. method

...

  1. and

...

  1. port

...

  1. 5005

...

  1. (as

...

  1. configured

...

  1. above)

...

Anchor

...

ContributingChanges
ContributingChanges

Contributing Changes

We use the Fork + Pull Model to manage community contributions. Please fork the repository and submit a pull request with your changes.

Here's a sample git workflow to get you started:

  1. Install Git
  2. Setup Git to auto-correct line endings:
    Code Block
    git config --global core.autocrlf input

...

  1. Create a Github account
  2. Fork the project from https://github.com/pentaho/big-data-plugin

...

  1. Clone

...

  1. your

...

  1. repository:

...

  1. Code Block

...

  1. git clone git@github.com:USERNAME/big-data-plugin.git

...

  1. * Hack away *
  2. Stage and commit changes. Please make sure your commit messages include the JIRA case for your changes. It should be in the format: [JIRA-CASE] Short description of fixes.:
    Code Block
    git add . && git commit

...

  1. Push changes back up to Github:
    Code Block
    git push
  2. Submit a pull request from your project page. Please include a brief summary of what you changed and why.

Git Resources

Here's a short list of resources to help you learn and master Git:

...

...

...

...

Documentation

Kettle Plugin Development

Getting started with the Pentaho Data Integration Java API

Step Documentation

Job Entry Documentation

Hadoop Configuration

Community Plugins

Here's a list of known community plugins that fall into the "big data" category:

Voldemort Lookup
HPCC Systems ECL Plugins