Hitachi Vantara Pentaho Community Wiki
Child pages
  • Kettle Execution on Storm

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

Kettle for Storm

Include Page
Labs Warning
Labs Warning

Excerpt

An experimental environment for executing a Kettle transformation as a Storm topology.

Project Info

Kettle for Storm empowers Pentaho ETL developers to process big data in real
time using their existing visual Pentaho ETL transformations across a cluster of
machines using Storm. It decomposes the transformation into a topology and wraps all steps in either a Storm Spout or a Bolt. The topology is then submitted to the cluster and runs continuously or until all inputs report end of data.

Widget Connector
width600
urlhttps://www.youtube.com/watch?v=RPoWdIWWPkc
height480

Closing the gap between batch and real time

Pentaho has lead led the big data ETL space for 3 years by providing ETL developers
a visual environment to design, test, and execute ETL that leverages the power of
MapReduce. Now with Kettle for Storm, that same ETL developer is immediately
productive with one of the most popular distributed streaming processing systems
today: Storm. Any existing Pentaho ETL transformations can be executed as realtime
processes via Storm - including those used in Pentaho MapReduce. This
powerful combination allows an ETL developer to provide data to business users
when they need it most without the delay of batch processing or overhead of
designing additional transformations.

Process data as it arrives

Pentaho ETL begins processing data as it arrives from the source and produces
the valuable data sets your business depends on immediately. Get up to the
second insight for your key business metrics by reacting when data arrives and
delivering real-time dashboards, reports, or intermediate data sets to be used by
your existing applications.

Hybrid workflows

Many of our customers have long running batch Pentaho ETL jobs that run within
Hadoop via MapReduce. Pentaho for Storm compliments these by allowing
developers to reuse existing transformations to process data immediately. Both
batch and real time workflows are powered by Pentaho ETL, empowering existing
developers to build upon years of knowledge to learn the most from their data,
instantly.

Leverage existing Kettle ETL

Kettle for Storm allows Pentaho ETL developers to reuse their knowledge and
beloved Kettle components to process data differently. Deliver data when its
needed - all with a familiar tool set.
Looking for additional tools for your Pentaho ETL tool kit? Check out the [Kettle
Marketplace|http://wiki.pentaho.com/display/EAI/Marketplace]!

Next steps

Today, Kettle for Storm can process many of your existing transformations but this
wouldn't be in Pentaho Labs if it were complete. We're continuing to build out
support for the entire Kettle ecosystem of steps. Stay tuned while we complete the
implementation.
Upcoming features:

  1. Spoon integration
  2. Support for aggregations, sorting, filtering, sampling
  3. Support for executing an entire transformation as a component within an
    existing Storm topology

:

Steps that do not emit at least one message for every input:

  • Sampling
  • Aggregation
  • Sorting
  • Filtering
  • First-class Spoon support
  • Repository-based transformations
  • Error handling
  • Conditional hops
  • Multiple end steps
  • Sub-transformations
  • Metrics: Kettle timing, throughput, logging

Try it out!

Instructions and code is available on GitHub. Download the preview build from our CI environment.

Image AddedHortonWorks Sandbox VM Quick Start take an existing plain vanilla sandbox VM and add in Storm and Kettle-Storm stuff.

https://github.com/deinspanjer/hw-sandbox-storm-provision

Wiki Markup
{scrollbar}