Hitachi Vantara Pentaho Community Wiki
Child pages
  • Excel Input (XLS, XLSX) including OpenOffice Workbooks (ODS)
Skip to end of metadata
Go to start of metadata

(warning) PLEASE NOTE: This documentation applies to Pentaho 8.1 and earlier. For Pentaho 8.2 and later, see Microsoft Excel Input on the Pentaho Enterprise Edition documentation site.

Description

The Excel Input step provides you with the ability to read data from one or more Excel and OpenOffice files. The following sections describe each of the available features for configuring the Excel Input step.

Note: The default Spread sheet type (engine) is set to Excel 97-2003 XLS. When you are reading other file types like OpenOffice, ODS, Excel 2007 and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.

Files Tab

The files tab is where you define the location of the Excel files from which you want to read. The table below contains options associated with the Files tab:

Option

Description

Step Name

Name of the step; the name has to be unique in a single transformation.

Spread sheet type (engine)

This field allows you to specify the spreadsheet type. (since version 4.1.0)  Currently the following are supported:
- Excel 97-2003 XLS: this is the default, backward compatible type provided for by the JXL software backend.
- Excel 2007 XLSX (Apache POI): If you select this spread sheet type you can read all known Excel file types.  Functionality provided by the Apache POI project.
- Excel 2007 XLSX (Apache POI Streaming): This spread sheet type allows to read in large Excel files (Experimental in 5.1).
- Open Office ODS: By selecting this type you can read OpenOffice spreadsheet using the ODFDOM engine.

The default Spread sheet type (engine) is set to Excel 97-2003 XLS. When you are reading other file types like OpenOffice, ODS, Excel 2007 and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.


File or directory

Specifies the location and/or name of the input text file. Note: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below.

Regular expression

Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (For further details see the Text File Input step and the section "Selecting file using Regular Expressions")

Exclude Regular Expression

Excludes all files (in a given location) that meet the criteria specified by this regular expression.

Selected Files

Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped.

Accept filenames from previous steps

Allows you to read in file names from a previous step in the transformation. You must also specify which step you are importing from, and the input field in that step from which you will retrieve the filename data

Show filenames(s)...

Displays a list of all files that will be loaded based on the current selected file definitions

Preview rows

Click Preview to examine the contents of the specified Excel file

Sheets

In this tab you can specify the names of the sheets in the Excel workbook to read.  For each of the sheet names you can specify the row and column to start at.  Note : the row and column numbers are zero (0) based. (start to number at 0).

IMPORTANT: it is also possible to read all sheets in the worksheet.  To arrive there, simply clear the "List of sheets to read" table and type the start row and column in the first row.  This will be used for all sheets.  Again, if you want to read all sheets in the workbook(s), it's important not to specify any sheet name (blank it out). In this case, the field structure of each sheet needs to be the same.

Content

The content tab allows you to configure the following properties:

Option

Description

Header

Enable if the sheets specified contain a header row to skip

No empty rows

Enable if you don't want empty rows in the output of this step

Stop on empty row

Makes the step stop reading the current sheet of a file when a empty line is encountered

Limit

Limits the number of rows to this number (zero (0) means all rows).

Encoding

Specifies the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.)

Error handling

The Error handling tab allows you to configure the following properties:

Option

Description

Strict types?

If checked, PDI will report data type errors in the input.

Ignore errors?

Enable if you want to ignore errors during parsing

Skip error lines?

If checked, PDI will skip lines that contain errors. These lines can be dumped to a separate file by specifying a path in the Failing line numbers files directory field below. If this is not checked, lines with errors will appear as NULL values in the output.

Warnings file directory

When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>

Error files directory

When errors occur, they are placed in this directory. The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>

Failing line numbers files directory

When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Fields tab

The fields tab is for specifying the fields that must be read from the Excel files. Use Get fields from header row to fill in the available fields if the sheets have a header row automatically.

The Type column performs type conversions for a given field. For example, if you want to read a date and you have a String value in the Excel file, specify the conversion mask. Note: In the case of Number to Date conversion (for example, 20051028--> October 28th, 2005) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion.

Option

Description

Name

The name of the field.

Type

The field's data type; String, Date or Number.

Length

The length option depends on the field type. Number: total number of significant figures in a number; String: total length of a string; Date: determines how much of the date string is printed or recorded.

Precision

The precision option depends on the field type, but only Number is supported; it returns the number of floating point digits.

Trim type

Truncates the field (left, right, both) before processing. Useful for fields that have no static length.

Repeat

If set to Y, will repeat this value if the field in the next row is empty.

Format

The format mask (number type). See Text File Input step and the section "Number Formats" for a complete description of format symbols.

Currency

Symbol used to represent currencies.

Decimal

A decimal point; this is either a dot or a comma.

Grouping

A method of separating units of thousands in numbers of four digits or larger. This is either a dot or a comma.

Additional output fields tab

This tab retrieves custom metadata fields to add to the step's output. The purpose of each field is defined in its name, but you can use these fields for whatever you want. Each item defines an output field that will contain the following information. Some of these are missing.

Option

Description

Full filename field

The full file name plus the extension.

Sheetname field

The worksheet name you're using.

Sheet row nr field

The current sheet row number.

Row nr written field

Number of rows written

Short filename field

The field name that contains the filename without path information but with an extension.

Extension field

The field name that contains the extension of the filename.

Path field

The field name that contains the path in operating system format.

Size field

The field name that contains the size of the file, in bytes.

Is hidden field

The field name that contains if the file is hidden or not (boolean).

Uri field

The field name that contains the URI.

Root uri field

The field name that contains only the root part of the URI.

  • No labels

1 Comment

  1. ATTENTION:

    In the wildcard field, the format is always starting with a dot. Example wildcard:   .*.xls.

    IF YOU FORGET THE DOT PRECEDING THE *, IT WILL NOT WORK!

    THIS IS THE CASE FOR ALL WILDCARDS FOR ALL INPUTS!