Hitachi Vantara Pentaho Community Wiki
Child pages
  • Kettle Execution on Storm
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 36 Next »

This is a Pentaho Labs project

Pentaho Labs projects are NOT supported product and in most cases, not even on the Pentaho roadmap. The purpose of labs is to do some preliminary work and get feedback from the Pentaho ecosystem on whether there is sufficient interest to continue development or put on the official roadmap. This fulfills the open source early and often mantra.

An experimental environment for executing a Kettle transformation as a Storm topology.

Project Info

  • Status: Early Prototype, proof of concept
  • Roadmap: Not on any roadmap, not committed
  • Availability: Open Source - GitHub - Download the preview
  • Contact: dmoran or use "Add Comment" at bottom of page
  • JIRA: none

Kettle for Storm empowers Pentaho ETL developers to process big data in real time using their existing visual Pentaho ETL transformations across a cluster of machines using Storm. It decomposes the transformation into a topology and wraps all steps in either a Storm Spout or a Bolt. The topology is then submitted to the cluster and runs continuously or until all inputs report end of data.

Closing the gap between batch and real time

Pentaho has led the big data ETL space for 3 years by providing ETL developers a visual environment to design, test, and execute ETL that leverages the power of MapReduce. Now with Kettle for Storm, that same ETL developer is immediately productive with one of the most popular distributed streaming processing systems today: Storm. Any existing Pentaho ETL transformations can be executed as realtime processes via Storm - including those used in Pentaho MapReduce. This powerful combination allows an ETL developer to provide data to business users when they need it most without the delay of batch processing or overhead of designing additional transformations.

Process data as it arrives

Pentaho ETL begins processing data as it arrives from the source and produces the valuable data sets your business depends on immediately. Get up to the second insight for your key business metrics by reacting when data arrives and delivering real-time dashboards, reports, or intermediate data sets to be used by your existing applications.

Hybrid workflows

Many of our customers have long running batch Pentaho ETL jobs that run within Hadoop via MapReduce. Pentaho for Storm compliments these by allowing developers to reuse existing transformations to process data immediately. Both batch and real time workflows are powered by Pentaho ETL, empowering existing developers to build upon years of knowledge to learn the most from their data, instantly.

Leverage existing Kettle ETL

Kettle for Storm allows Pentaho ETL developers to reuse their knowledge and beloved Kettle components to process data differently. Deliver data when its needed - all with a familiar tool set. Looking for additional tools for your Pentaho ETL tool kit? Check out the Kettle Marketplace!

Next steps

Today, Kettle for Storm can process many of your existing transformations but this wouldn't be in Pentaho Labs if it were complete. We're continuing to build out support for the entire Kettle ecosystem of steps:

Steps that do not emit at least one message for every input:

  • Sampling
  • Aggregation
  • Sorting
  • Filtering
  • First-class Spoon support
  • Repository-based transformations
  • Error handling
  • Conditional hops
  • Multiple end steps
  • Sub-transformations
  • Metrics: Kettle timing, throughput, logging

Try it out!

Instructions and code is available on GitHub

  • No labels