Hitachi Vantara Pentaho Community Wiki

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Include Page
NavPanel
NavPanel

Pentaho Big Data Plugin
Div
stylefloat:right
Image Added

The Pentaho Big Data Plugin Project provides support for an ever-expanding BigData Big Data community within the Pentaho ecosystem. It is a plugin for the Pentaho Kettle engine which can be used within Pentaho Data Integration (Kettle), Pentaho Reporting, and the Pentaho BI Platform.

Pentaho Big Data Plugin Features

This project contains the implementations for connecting to or preforming the following:

  • Pentaho MapReduce: visually design MapReduce jobs as Kettle transformations
  • HDFS File Operations: Read/write directly from any Kettle step. All made possible by the ubiquitous use of Apache VFS throughout Kettle
  • Data Sources

      ...

        • JDBC connectivity
          • Apache

      ...

          • Hive
        • Native RPC connectivity for reading/writing

          ...

              • Apache HBase
              • Cassandra
              • MongoDB
              • CouchDB

          Key Links

          ...

          • Git Repository:

          ...

          ...

          ...

          ...

          • Link to Kettle plugin development

          ...

          Community and where to find help

          The Big Data Forum exists for both users and developers. The community also manages the ##pentaho IRC channel on irc.freenode.net.

          Quick Start: Building the project

          The Pentaho Big Data Plugin is built with Apache Ant and uses Apache Ivy for dependency management. All you'll need to get started is Ant 1.8.0 or newer to build the project. The build scripts will download Ivy if you do not already have it installed.

          ...

          now a maven project. Please refer to the project readme for build information.

          Debugging

          We recommend providing unit tests where possible and debugging your code through them.

          Remote Debugging

          If you want to see your code executing within Spoon we recommend remote debugging. This approach can be used with Pan, Kitchen, or the BA/DI Server as well. The workflow is as follows:

          1. Download/Checkout Kettle (currently at 4.4.0-SNAPSHOT)
            1. CI Build: http://ci.pentaho.com/job/Kettle-4.4/
            2. SVN Source: svn://source.pentaho.org/svnkettleroot/

          ...

          Developing with Eclipse

          We recommend Apache IvyDE to manage your Ivy dependencies within Eclipse.

          1. Import pentaho-big-data-plugin into Eclipse
          2. Resolve the project using IvyDE

          If IvyDE is not an option then you can manually add the jars from lib/ and libswt/ to your class path. This project, like all other Pentaho projects, uses the open-source Subfloor Ant build framework. Running the following targets will configure the Eclipse project to reference the required libraries:

          Code Block
          ant resolve create-dot-classpath

          Then import or refresh the project in Eclipse and add the SWT libraries for your architecture, e.g. for Mac OS X x64:
          Image Removed

            1. Kettle/branches/4.4.0)
          1. Configure the big data plugin's kettle.dist.dir property via override.properties:
            1. Create override.properties in the root of the big-data-plugin. This file is a local override for any properties defined build.properties.
            2. Add the property: kettle.dist.dir and point it to your Kettle install dir based on if you're using the CI download or building from source:
              1. CI Download: kettle.dist.dir=../data-integration
              2. Building from source: kettle.dist.dir=../Kettle/distrib (Note: You must build kettle with `ant distrib` before being able to launch it when using the source. This will build Kettle into Kettle/distrib. For more information see PDI Developer Information)
          2. Build and "install" the plugin into Kettle with ant: ant resolve install-plugin (you can drop the resolve after the first build unless the dependencies change)
          3. Launch Kettle with remote debugging and attach Eclipse to the process
            1. Configure the script you're using to start Spoon (Mac OS X uses the Data Integration 64-bit/Contents/Info.plist:
              1. Add -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 to the JVM Arguments
                1. spoon.sh or Spoon.bat: Update line 158 and add the above JVM arguments to the OPT variable
                2. Data Integration 64-bit/Contents/Info.plist: Update the VMOptions property and append the above JVM arguments
          4. Start Spoon
          5. Connect to the JVM with Eclipse's Remote Java Application debug configuration, using the socket attach method and port 5005 (as configured above)

          Anchor
          ContributingChanges
          ContributingChanges

          Contributing Changes

          We use the Fork + Pull Model to manage community contributions. Please fork the repository and submit a pull request with your changes.

          Here's a sample git workflow to get you started:

          1. Install Git
          2. Setup Git to auto-correct line endings:
            Code Block
            git config --global core.autocrlf input
          3. Create a Github account
          4. Fork the project from https://github.com/pentaho/big-data-plugin
          5. Clone your repository:
            Code Block
            git clone git@github.com:USERNAME/big-data-plugin.git
          6. * Hack away *
          7. Stage and commit changes. Please make sure your commit messages include the JIRA case for your changes. It should be in the format: [JIRA-CASE] Short description of fixes.:
            Code Block
            git add . && git commit
          8. Push changes back up to Github:
            Code Block
            git push
          9. Submit a pull request from your project page. Please include a brief summary of what you changed and why.

          Git Resources

          Here's a short list of resources to help you learn and master Git:

          Documentation

          Kettle Plugin Development

          Getting started with the Pentaho Data Integration Java API

          Step Documentation

          Job Entry Documentation

          Hadoop Configuration

          Community Plugins

          Here's a list of known community plugins that fall into the "big data" category:

          Voldemort Lookup
          HPCC Systems ECL Plugins