Access Keys:
Skip to content (Access Key - 0)

Human Inference is founded in 1986 and is seen as a visionary DQ vendor for years by Gartner. Human Inference delivers Data Profiling, Data Cleansing, Duplicate Detection, CDD and MDM for customer contact data to a large group of customers. Human Inference now, in cooperation with Pentaho, delivers a complete Data Quality stack for Pentaho. Human Inference offers these service via their www.easydq.com platform.

Register for the Webinar, Better Data for Better Analytics, led by Human Inference and Pentaho on Thursday, May 10th 2012 at http://pentaho.com/human-inference-webinar

Data Profiling (DataCleaner)

Data Profiling (DataCleaner) is fully integrated within Pentaho Kettle / PDI and you can profile your data directly within Spoon.

  • PDI/Kettle Version 4.x: Download the plug-in from https://pentaho.box.com/s/rslfm6ksa57evaxpwraq and unzip into your folder data-integration\plugins\spoon
  • PDI/Kettle Version 5.x and higher: Select Help / Marketplace, chose the appropriate DataCleaner Data Profiling plugin and press the Install button.
  • After starting Spoon:
    • Right click on a step you want to profile its data and select Profile from the context menu.
    • Execute the Transformation with Launch and wait until DataCleaner starts up.
  • Further information about DataCleaner can be found here: http://datacleaner.eobjects.org/

Data Cleansing

Data Cleansing Data Cleansing is integrated as regular PDI steps in Kettle. Data Cleansing supports the following contact cleansing functionality;

Address cleansing, Use the EasyDQ Address cleansing function to validate, standardize and correct your address information for more than 240 countries Our address service provides immediate ROI by making sure that your mailings and customer registers are correct and up-to-date. Furthermore it can enrich your data by providing more details about the addresses than you already had!

Name cleansing is all about making sure that the names you have are correct. With a database of billions of name parts, the EasyDQ Name service is able to check if names are correct, if they look suspicious (like Mickey Mouse) and which parts of the name is what. Furthermore we can enrich your name data by suggesting the most plausible gender of a name and providing regional information about likeliness of a particular name.

Phone number checking, parsing and formatting is provided out of the box! We deliver a Phone service which allows you to make sure that the numbers you have match with the countries your contacts live in. You can also enrich your data with more information about the phone numbers, such as line type (mobile, fax etc.) and correctly formatted country and regional prefix codes.

Email Cleansing, determining if an email is correctly formatted and if it really exists are two different things. The first is quite easy, but checking if it is a real email requires knowledge. Our Email service has knowledge about emailing domains and about plausible addressees within those domains. Use our service to get rid of spam email addresses and to correct common misspellings in email addresses (such as gmail.com or hotmall.com).

Duplicate Detection

Duplicate Detection is part of our Data Profiling solution (DataCleaner) Duplicate detection (aka. Matching and Deduplication) is all about identifying if several records are describing the same real-life entity. Especially in customer data it is surprisingly common to see that the same customer is registered multiple times! Use the Duplicate detection feature to avoid mistakes and take advantage of opportunities for e.g. cross-selling and cost reduction.

Additional Resources

For all information about Data Profiling go to;  http://datacleaner.eobjects.org

For all information about Data Cleansing go to; http://www.easydq.com/pentaho

Help for Data Cleansing in PDI go to; http://help.easydq.com/pentaho/latest/

Installation Instructions for Data Cleansing in Pentaho go to;http://www.easydq.com/pentaho-installation-instructions

Further information:

  • Complete Data Quality stack for Pentaho
  • License information: Data Profiling (DataCleaner) is Open Source under the LGPL license
  • Data Cleansing, free trial for 500 credits. For pricing go to: http://www.easydq.com/faq
  • Data Deduplication, free use until 500.000 values. for pricing go to: http://www.easydq.com/faq 

Register for the Webinar, Better Data for Better Analytics, led by Human Inference and Pentaho on Thursday, May 10th 2012 at http://pentaho.com/human-inference-webinar


This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki