Hitachi Vantara Pentaho Community Wiki
Skip to end of metadata
Go to start of metadata

Status: This page collects all kind of information around MDM with Kettle and will be extended over time with more details and sample solutions.

Introduction

Master Data Management (MDM) a set of processes and tools to have a single set of master data for different operational systems. The tools and processes are often used for Customer Data Integration (CDI) and Product Data Integration (PDI) in special.

Kettle supports the following MDM concepts and processes. There are many definitions of MDM, so we selected the most common terms. If you have needs that are not covered here, please let us know.

Profiling

Kettle has a default profiling capability as well as more advanced features by the DataCleaner plug-in, see Kettle Data Profiling with DataCleaner.

Data Quality

Data Quality can be achieved by the plug-ins from our partners Human Inference and Melissa Data, see Data Quality Integration Home. It is also possible to use existing steps depending on the use case.

Data Cleansing, Validation, Harmonization, Standardization, Data Consolidation (Deduplication, Enrichment)

This can be achieved by many steps, e.g.

  • Validation step
  • Fuzzy match step
  • The Calculator step offers a lot of functions to clean, harmonize and standardize data, e.g. First letter of each word of a string in capital, Upper case, Lower case, Return only digits, Remove digits and many more
  • The Calculator step offers a lot of functions to match similar strings for deduplication, e.g. Levenshtein Distance, Metaphone of A (Phonetics), Double metaphone of A, SoundEx, Dameraulevenshtein distance, NeedlemanWunsch distance, Jaro similitude
  • Different Join and Lookup steps for reference or lookup tables
  • Scripting steps
  • Business Logic steps by the Rule Executor and Rule Accumulator steps (experimental)
  • Data Validator step
  • Filter step
  • and more depending on the use case

The so called Step Error Handling can be used to handle special error conditions on many steps and route the failing data rows to different stream, e.g. for further processing.

Tracing master data changes and supporting changing hierarchies

The concept of Slowly Changing Dimensions is very well suited to trace master data changes and supporting changing hierarchies. Kettle support Slowly Changing Dimensions by the Dimension lookup/update step.

Depending on the use case, other concepts can be used as well.

Transformations and Mapping

The transformation and mapping of data is one of the core capabilities of Kettle and can be achieved by many steps.

Scalable

Kettle is scalable due to it's own cluster node technology using Carte servers as well as seamless integration into NoSQL, BigData and Hadoop technologies.

Data Distribution, Collaboration, Workflow and Enterprise Application Integration

There is no predefined out-of-the box user interface to edit data or processes, but it is possible to create custom user interfaces that can be run in the existing IT infrastructure or within the Pentaho BA Server.

There is also no predefined data store for master data, because we believe this is better defined individual for the project needs and in most cases it does already exist.

Due to the many integration options by default, it's pluggable architecture and through APIs, Kettle fits very well in existing IT landscapes. For instance it can be integrated through message queues (JMS), Enterprise Service Bus (JBoss ESB, Mule ESB and others), WebServices, E-Mail, FTP and many more. Kettle can act as a consumer (so working actively as a client to manage the data flow) as well as a service provider via the Server solutions.

  • No labels