Hitachi Vantara Pentaho Community Wiki
Child pages
  • Extracting Data from Snappy Compressed Files

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Wiki Markup


How to configure client-side PDI so that files compressed using the Snappy codec can be decompressed using the Hadoop file input or Text file input step.


  • Pentaho Data Integration
  • Snappy compressed source data (either inside or outside of HDFS)

Step-By-Step Instructions

Configure PDI to Access Snappy Native Libraries

In order to use client-side PDI to decompress files encoded by hadoop-snappy (the snappy implementation used in Hadoop) it is necessary to build and install both the hadoop-snappy JNI interface and the snappy native libraries for your platform. Instructions for achieving this can be found at:

In particular, the instructions under "Build Hadoop Snppy" should be followed. The "Install Hadoop Snappy in Hadoop" instructions should only be followed if

  1. You want to decompress snappy encoded files within a Pentaho map reduce job (see Using Compression with Pentaho MapReduce for more information), and
  2. Your Hadoop installation does not have snappy hadoop-snappy installed already (recent Hadoop distributions from Cloudera etc. are configured with hadoop-snappy out of the box)

Once you have built hadoop-snappy:

  1. Uncompress the hadoop-snappy-x.y.z-SNAPSHOT.tar.gz archive the build process creates somewhere on your client PDI machine
  2. Copy hadoop-snappy-x.y.z-SNAPSHOT/lib/hadoop-snappy-x.y.z-SNAPSHOT.jar to libext/bigdata in your client PDI installation
  3. Set the java.library.path property to point to the subdirectory of hadoop-snappy-x.y.z-SNAPSHOT/lib/native that corresponds to your platform

Where to set the java.library.path in Step 3 will vary depending on your platform:

  • Under Linux edit "" in your PDI installation directory and add an entry to the LIBPATH variable
  • Under Windows edit "Spoon.bat" and add an entry to the LIBSPATH variable
  • Under Mac OS X edit "Data Integration" and add "-Djava.library.path=<path to the subdirectory in Step 3>" to the string entry under the key "VMOptions"

Verifying that Snappy Decompression is Available to PDI

After following the instructions of the previous section restart PDI. If hadoop-snappy and the snappy native libraries have been installed correctly on the PDI client machine then a "Hadoop-snappy" option will be available under the "Compression" drop-down box on the "Content" tab of the Hadoop file input and Text file input steps.

Wiki Markup