Hitachi Vantara Pentaho Community Wiki
Child pages
  • Hadoop File Input
Skip to end of metadata
Go to start of metadata

(warning) PLEASE NOTE: This documentation applies to Pentaho 7.1 and earlier. For Pentaho 8.0 and later, see Hadoop File Input on the Pentaho Enterprise Edition documentation site.


The Hadoop File Input step is used to read data from a variety of different text-file types stored on a Hadoop cluster. The most commonly used formats include comma separated values (CSV files) generated by spreadsheets and fixed width flat files.
This step enables you to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept file names from a previous step.


These tables describe all available Hadoop File Input options. 

File Tab



Step Name

Optionally, you can change the name of this step to fit your needs. Every step in a transformation must have a unique name.


Indicates the file system or specific cluster on which the item you want to input can be found.  Options are Local, <Static>, S3, or <Hadoop Cluster Name>.

  • Local: Specifies that the item specified in the File/Folder field is in a file system that is local to Spoon.
  • <Static>: Specifies that the item specified in the File/Folder field should use the path name in that field, exactly.  Use this if you already know a file path and you simply want to copy and paste it into the window.
  • S3: Specifies that the item specified in the File/Folder field is in a file system that is on the S3 file system.
  • <Hadoop Cluster Name>: Specifies that the item specified in the File/Folder field is in the cluster indicated.


Specifies the location and/or name of the text file to read. Click Browse to launch the Open window and to navigate to the file or folder.

Wildcard (RegExp)

Specify the regular expression you want to use to select the files in the directory specified in the File or Directory field. For example, you want to process all files that have a .txt output.


Indicates whether the file is required.

Include subfolders

Indicates whether to include subdirectories (subfolders).

Show filenames(s)...

Displays a list of all files that are loaded based on the current selected file definitions.

Show file content

Displays the raw content of the selected file.

Show content from first data line

Displays the content from the first data line for the selected file.

Selecting file using Regular Expressions

The Text File Input step can search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using '*' and '?' wildcards. This table describes a few examples of regular expressions.

File Name

Regular Expression

Files selected



Find all files in /dirA/ with names containing user data and ending with .txt



Find all files in /dirB/ with names that start with AAA



Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Accepting file names from a previous step

This option allows even more flexibility in combination with other steps, such as Get File Names. You can specify your file name and pass it to this step. This way the file name can come from any source; a text file, database table, and so on.



Accept file names from previous steps

Enables the option to get file names from previous steps

Pass through fields from previous step

Enables the option to get field information from previous steps.

Step to read file names from

Step from which to read the file names

Field in the input to use as file name

Text File Input looks in this step to determine which filenames to use

Open File



Access Key

This option only appears if you select S3 in the Source Environment field in the Hadoop File Input window. User name needed to access the S3 file system.

Secret Key

This option only appears if you select S3 in the Source Environment field in the Hadoop File Input window. Password needed to access the S3 file system. 

Open from Folder

Indicates the path and name of the directory you want to browse.  This directory becomes the active directory.

Up One Level

Displays the parent directory of the active directory shown in the Open from Folder field.


Deletes a folder from the active directory.

Create Folder

Creates a new folder in the active directory.


Displays the active directory, which is the one that is listed in the Open from Folder field.


Applies a filter to the results displayed in the active directory contents.

Content Tab

Options under the Content tab allow you to specify the format of the text files that are being read. This table is a list of the options associated with this tab.



File type

Can be either CSV or Fixed length. Based on this selection, Spoon launches a different helper GUI when you click Get Fields in the Fields tab.


One or more characters that separate the fields in a single line of text. Typically this is a semicolon ( ; ) or a tab.


Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional. If you use repeat an enclosures allow text line 'Not the nine o''clock news.'. With ' the enclosure string, this gets parsed as Not the nine o'clock news.

Allow breaks in enclosed fields?

Not implemented


Specify an escape character (or characters) if you have these types of characters in your data. If you have a backslash ( / ) as an escape character, the text 'Not the nine o\'clock news' (with a single quote [ ' ] as the enclosure) gets parsed as Not the nine o'clock news.

Header & number of header lines

Enable if your text file has a header row (first lines in the file). You can specify the number of times the header lines appears.

Footer & number of footer lines

Enable if your text file has a footer row (last lines in the file). You can specify the number of times the footer row appears.

Wrapped lines and number of wraps

Use if you deal with data lines that have wrapped beyond a specific page limit. Headers and footers are never considered wrapped.

Paged layout and page size and doc header

Use these options as a last resort when dealing with texts meant for printing on a line printer. Use the number of document header lines to skip introductory texts and the number of lines per page to position the data lines


Enable if your text file is in a Zip or GZip archive. Only the first file in the archive is read.

No empty rows

Do not send empty rows to the next steps.

Include file name in output

Enable if you want the file name to be part of the output

File name field name

Name of the field that contains the file name

Rownum in output?

Enable if you want the row number to be part of the output

Row number field name

Name of the field that contains the row number


Can be either DOS, UNIX, or mixed. UNIX files have lines that are terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.


Specify the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.

Be lenient when parsing dates?

Disable if you want strict parsing of data fields. If case-lenient parsing is enabled dates like Jan 32nd become Feb 1st.

The date format Locale

This locale is used to parse dates that have been written in full such as "February 2nd, 2006." Parsing this date on a system running in the French (fr_FR) locale would not work because February is called Février in that locale.

Add filenames to result

Adds filenames to result filenames list.

Error Handling Tab

Options under the Error Handling tab allow you to specify how the step reacts when errors occur, such as, malformed records, bad enclosure strings, wrong number of fields, premature line ends. This describes the options available for Error handling.



Ignore errors?

Enable if you want to ignore errors during parsing

Skip error lines

Enable if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers on which the errors occur. Lines with errors are not skipped. The fields that have parsing errors are empty (null).

Error count field name

Add a field to the output stream rows. This field contains the number of errors on the line.

Error fields field name

Add a field to the output stream rows; this field contains the field names on which an error occurred.

Error text field name

Add a field to the output stream rows; this field contains the descriptions of the parsing errors that have occurred.

Warnings file directory

When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>

Error files directory

When errors occur, they are placed in this directory. The name of the file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>

Failing line numbers files directory

When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Filters Tab

Options under the Filters tab enables you to specify the lines you want to skip in the text file. This table describes the available options for defining filters.



Filter string

The string for which to search.

Filter position

The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero (0), the filter string is searched for in the entire string.

Stop on filter

Specify Y here if you want to stop processing the current text file when the filter string is encountered.

Positive match

Turns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded.

Fields Tab

The options under the Fields tab allow you to specify the information about the name and format of the fields being read from the text file. Available options include:




Name of the field.


Type of the field can be either String, Date or Number.


See Number Formats for a complete description of format symbols.


For Number: Total number of significant figures in a number. For String: total length of string. For Date: length of printed output of the string, for instance, 4 only gives back the year.


For Number: Number of floating point digits. For String, Date, Boolean: unused.


Used to interpret numbers like $10,000.00 or E5.000,00.


A decimal point can be a "." (10;000.00) or "," (5.000,00).


A grouping can be a dot "," (10;000.00) or "." (5.000,00).

Null if

Treat this value as null.


Default value in case the field in the text file was not specified (empty).


Type trim this field, left, right, both, before processing.


If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y/N).

Number formats

The information about number formats was taken from the Sun Java API documentation, Decimal Formats.












Digit, zero shows as absent




Decimal separator or monetary decimal separator



Minus sign




Grouping separator




Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.


Sub pattern boundary


Separates positive and negative sub patterns


Prefix or suffix


Multiply by 100 and show as percentage


Prefix or suffix


Multiply by 1000 and show as per mille


Prefix or suffix


Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.


Prefix or suffix


Used to quote special characters in a prefix or suffix, for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".

Scientific Notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example "0.###E0" formats the number 1234 as "1.234E3".

Date Formats

The information about Date formats was taken from the Sun Java API documentation, Date Formats.


Date or Time Component




Era designator






1996 or 96


Month in year


July, Jul, or 07


Week in year




Week in month




Day in year




Day in month




Day of week in month




Day in week


Tuesday or Tue


Am/pm marker




Hour in day (0-23)

Number 0



Hour in day (1-24)

Number 24



Hour in am/pm (0-11)

Number 0



Hour in am/pm (1-12)

Number 12



Minute in hour

Number 30



Second in minute

Number 55




Number 978



Time zone

General time zone

Pacific Standard Time, PST, or GMT-08:00


Time zone

RFC 822 time zone


Metadata Injection Support (7.x and later)

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

  • No labels