Added by Matt Casters, last edited by Matt Casters on Jul 08, 2010  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Description

The Text File Input step is used to read date from a variety of different text-file types. The most commonly used formats include Comma Separated Values (CSV files) generated by spreadsheets and fixed width flat files.

The Text File Input step provides you with the ability to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept filenames from a previous step making filename handling more even more generic.

The following sections describe the available options for configuring the Text file input step.

File options

Please note that when you are reading multiple files that the "Updated" column in the metrics indicates the file number being read. After execution the "Updated" metric as such indicates the number of files read.

The table below provides a detailed descriptions of the features available on the File tab:

Option Description
File or directory This field specifies the location and/or name of the input text file.Note: Press the "add" button to add the file/directory/wildcard combination to the list of selected files (grid) below.
Regular expression Specify the regular expression you want to use to select the files in the directory specified in the previous option.
Selected Files This table contains a list of selected files (or wildcard selections) along with a property specifying if file is required or not. If a file is required and it isn't found, an error is generated. Otherwise, the filename is skipped.
Show filenames(s)... Displays a list of all files that will be loaded based on the current selected file definitions.
Show file content Displays the content of the selected file.
Show content from first data line Displays the content from the first data line only for the selected file.

Selecting Files from which to Read Data

The file tab (shown above, where?) is where you identify the file or files from which you want to read data. To specify a file:

  1. Enter the location of the file in the File or directory field or click Browse to locate the file.
  2. Click Add to add a file to the list of selected files as shown in the example below:

Selecting file using Regular Expressions

You can also have the Text File Input step search for files by specifying a wildcard in the form of a regular expression. Regular expressions are more sophisticated than using '*' and '?' wildcards.

Here are a few examples of regular expressions:

Filename Regular Expression Files selected
/dirA/ .userdata.\.txt Find all files in /dirA/ with names containing userdata and ending with .txt
/dirB/ AAA.* Find all files in /dirB/ with names that start with AAA
/dirC/ [ENG:A-Z][ENG:0-9].* Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Accepting filenames from a previous step

This option allows even more flexibility in combination with other steps such as "Get Filenames". You can construct your filename and pass it to this step. This way the filename can come from any source: text file, database table, etc.

Option Description
Accept filenames from previous steps Enables the option to get filenames from previous steps.
Step to read filenames from Step from which to read the filenames
Field in the input to use as filename Text File Input looks in this step to determine which filenames to use

Content specification

The content tab allows you to specify the format of the text files that are being read. Below is a list of the options associated with this tab:

Option Description
File type Can be either CSV or Fixed length. Based on this selection, Spoon will launch a different helper GUI when you press the "get fields" button in the last "fields" tab.
Separator One or more characters that separate the fields in a single line of text. Typically this is ; or a tab. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Enclosure Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional. If you use repeat an enclosures allow text line 'Not the nine o''clock news.'. With ' the enclosure string, this gets parsed as Not the nine o'clock news. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Allow breaks in enclosed fields? Not implemented, yet. Also see PDI-388.
Escape Specify an escape character (or characters) if you have these types of characters in your data. If you have \ as an escape character, the text 'Not the nine o\'clock news' (with ' the enclosure) gets parsed as Not the nine o'clock news. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Header & number of header lines Enable if your text file has a header row (first lines in the file); you can specify the number of times the header lines appears.
Footer & number of footer lines Enable if your text file has a footer row (last lines in the file); you can specify the number of times the footer row appears.
Wrapped lines and number of wraps Use if you deal with data lines that have wrapped beyond a specific page limit; note that headers and footers are never considered wrapped
Paged layout and page size and doc header Use these options as a last resort when dealing with texts meant for printing on a line printer; use the number of document header lines to skip introductory texts and the number of lines per page to position the data lines
Compression Enable if your text file is placed in a Zip or GZip archive.Note: At the moment, only the first file in the archive is read.
No empty rows Do not send empty rows to the next steps.
Include filename in output Enable if you want the filename to be part of the output
Filename field name Name of the field that contains the filename
Rownum in output? Enable if you want the row number to be part of the output
Row number field name Name of the field that contains the row number
Rownum by file? Allows the row number to be reset per file
Format Can be either DOS, UNIX or mixed. UNIX files have lines that are terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
Encoding Specify the text file encoding to use; leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.
Limit Sets the number of lines that is read from the file; 0 means read all lines.
Be lenient when parsing dates? Disable if you want strict parsing of data fields; if case-lenient parsing is enabled, dates like Jan 32nd will become Feb 1st.
The date format Locale This locale is used to parse dates that have been written in full such as "February 2nd, 2006;" parsing this date on a system running in the French (fr_FR) locale would not work because February is called Février in that locale.

Error handling

The error handling tab allows you to specify how the step reacts when errors occur. The table below describes the options available for Error handling:

Option Description
Ignore errors? Enable if you want to ignore errors during parsing
Skip error lines Enable if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers on which the errors occurred. Lines with errors are not skipped, the fields that have parsing errors, will be empty (null)
Error count field name Add a field to the output stream rows; this field contains the number of errors on the line
Error fields field name Add a field to the output stream rows; this field contains the field names on which an error occurred
Error text field name Add a field to the output stream rows; this field contains the descriptions of the parsing errors that have occurred
Warnings file directory When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>
Error files directory When errors occur, they are placed in this directory. The name of the file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>
Failing line numbers files directory When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Filters

The filters tab provides you with the ability to specify the lines you want to skip in the text file. The table below describes the available options for defining filters:

Option Description
Filter string The string for which to search
Filter position The position where the filter string has to be at in the line. Zero (0) is the first position in the line. If you specify a value below zero (0) here, the filter string is searched for in the entire string.
Stop on filter Specify Y here if you want to stop processing the current text file when the filter string is encountered.

Fields

The fields tab allows you to specify the information about the name and format of the fields being read from the text file. Available options include:

Option Description
Name Name of the field
Type Type of the field can be either String, Date or Number
Format See Number Formats for a complete description of format symbols.
Length For Number: Total number of significant figures in a number; For String: total length of string; For Date: length of printed output of the string (e.g. 4 only gives back the year).
Precision For Number: Number of floating point digits; For String, Date, Boolean: unused;
Currency Used to interpret numbers like $10,000.00 or E5.000,00
Decimal A decimal point can be a "." (10;000.00) or "," (5.000,00)
Grouping A grouping can be a dot "," (10;000.00) or "." (5.000,00)
Null if Treat this value as NULL
Default Default value in case the field in the text file was not specified (empty)
Trim type trim this field (left, right, both) before processing
Repeat If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y/N) (huh?)

Number Formats

The information on Number formats was taken from the Sun Java API documentation, to be found here: http://java.sun.com/j2se/1.4.2/docs/api/java/text/DecimalFormat.html

Symbol Location Localized Meaning
0 Number Yes Digit
# Number Yes Digit, zero shows as absent
. Number Yes Decimal separator or monetary decimal separator
- Number Yes Minus sign
, Number Yes Grouping separator
E Number Yes Separates mantissa and exponent in scientific notation; need not be quoted in prefix or suffix
; Sub pattern boundary Yes Separates positive and negative sub patterns
% Prefix or suffix Yes Multiply by 100 and show as percentage
\u2030 Prefix or suffix Yes Multiply by 1000 and show as per mille
€ (\u00A4) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
' Prefix or suffix No Used to quote special characters in a prefix or suffix, for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".

Scientific Notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation (for example, "0.###E0" formats the number 1234 as "1.234E3".

Date formats

The information on Date formats was taken from the Sun Java API documentation, located at:
http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html

Letter Date or Time Component Presentation Examples
G Era designator Text AD
y Year Year 1996; 96
M Month in year Month July; Jul; 07
w Week in year Number 27
W Week in month Number 2
D Day in year Number 189
d Day in month Number 10
F Day of week in month Number 2
E Day in week Text Tuesday; Tue
a Am/pm marker Text PM
H Hour in day (0-23) Number 0  
k Hour in day (1-24) Number 24  
K Hour in am/pm (0-11) Number 0  
h Hour in am/pm (1-12) Number 12  
m Minute in hour Number 30  
s Second in minute Number 55  
S Millisecond Number 978  
z Time zone General time zone Pacific Standard Time; PST; GMT-08:00
Z Time zone RFC 822 time zone -0800

Extras

Function/Button Description
Show filenames Displays a list of all the files selected. Note that if the transformation is to be run on a separate server, the result might be incorrect.
Show file content Displays the first lines of the text-file. Make sure that the file-format is correct. When in doubt, try both DOS and UNIX formats.
Show content from first data line Helps you position the data lines in complex text files with multiple header lines and more.
Get fields Allows you to guess the layout of the file. In case of a CSV file, this is performed almost automatically. When you select a file with fixed length fields, you must specify the field boundaries using a wizard.
Preview rows Preview the rows generated by this step.