Example for Data Profiling (DataCleaner) with Kettle

Data Profiling (DataCleaner) is fully integrated within Pentaho Kettle / PDI and you can profile your data directly within Spoon.

After the progress information, you get the results for you Number and String Analyzer, for the Number Analyzer you get null values in this sample:

When you click on details of the 8 null rows, you get the following information:

For the String Analyzer, you get the following results:

Look at the Diacritic chars and see the details:

Depending on your target database or file character sets, you may need to change these special characters. This can be done within the Kettle transformation.


Q: I want to avoid to export all the data of the transformation to DataCleaner. How can I profile a sample set of my data?
A: You may use the Reservoir Sampling step (see Statistics category) within Kettle.

Q: Where can I get support?

A: The DataCleaner integration is community supported and Pentaho and Human Inference invested into this integration. See the Kettle and DataCleaner forums for community support.
Pentaho Customer Support responds to all questions directly associated with the Pentaho products. Full production support (severity levels one through four) is not given to the Data Profiling capabilities provided by DataCleaner whereas any general questions are welcome. For further information contact Pentaho Support.
DataCleaner Support including the Pentaho integration is provided by Human Inference according to the support and maintenance matrix.

Q: Where can I find more information?
A: Further information about DataCleaner can be found here: http://datacleaner.eobjects.org/