Hitachi Vantara Pentaho Community Wiki
Child pages
  • Streaming XML Input
Skip to end of metadata
Go to start of metadata

This step is deprecated. Please use the Get Data From XML or XML Input Stream (StAX) steps.

Description

The purpose of this step is to provide value parsing. This step is based on SAX parser to provide better performances with larger files. It is very similar to Xml Input, there are only differences in content and field tabs. The following sections describe in detail the properties and settings available for the Streaming XML input step.

File Tab

Option

Description

Step name

Name of the step. This name has to be unique in a single transformation.

File or directory

This field specifies the location and/or name of the input text file.

Note: press the "add" button to add the file/directory/wildcard combination to the list of selected files (grid) below.

Regular expression

Specify the regular expression you want to use to select the files in the directory specified in the previous option.

Selected Files

This table contains a list of selected files (or wildcard selections) along with a property specifying if file is required or not. If a file is required and it isn't found, an error is generated. Otherwise, the filename is simply skipped.

Show filenames(s)...

Displays a list of all files that will be loaded based on the current selected file definitions.

Content

Option

Description

Include filename in output & fieldname

Check this option if you want to have the name of the XML file to which the row belongs in the output stream. You can specify the name of the field where the filename will end up in.

Rownum in output & fieldname
Limit
Location

Check this option if you want to have a row number (starts at 1) in the output stream. You can specify the name where the integer will end up in. You can specify the maximum number of rows to read here. Specify the path by way of elements to the repeating part of the XML file. The element column is used to specify the element and position as follows:

  • A: still specify an attribute
  • Ep: specify an element defined by position (equivalent to E in original XMLInput).
  • Ea: specify an element defined by an attribute and allow value parsing.
    Example:
    Ep=element/1       this is the first element
    called "element"
    Ea=element/att:val    this is the element
    called "element" that have an attribute called
    "att" with "val" value
    

Fields

Option

Description

Name

Name of the field

Type

Type of the field can be either String, Date or Number

Format

See Number Formats for a complete description of format symbols.

Length

For Number: Total number of significant figures in a number;
For String: total length of string;
For Date: length of printed output of the string (e.g. 4 only gives back the year).

Precision

For Number: Number of floating point digits;
For String, Date, Boolean: unused;

Currency

Used to interpret numbers like $10,000.00 or E5.000,00

Decimal

A decimal point can be a "." (10;000.00) or "," (5.000,00)

Group

A grouping can be a dot "," (10;000.00) or "." (5.000,00)

Trim type

type trim this field (left, right, both) before processing

Null if

treat this value as NULL

Repeat

Y/N: If the corresponding value in this row is empty: repeat the one from the last time it was not empty

Position

Position: The position of the XML element or attribute. You use the following syntax to specify the position of an element:
The first element called "element": E=element/1
The first attribute called "attribute": A=attribute/1
The first attribute called "attribute" in the second "element" tag: E=element/2, A=attribute/1

Note: You can auto-generate all the possible positions in the XML file supplied by using the "Get Fields" button.
Note: Support was added for XML documents where all the information is stored in the Repeating (or Root) element. The special R= locater was added to allow you to grab this information. The "Get fields" button finds this information if it's present.

Streaming XML Example

Consider the following XML:
 
Suppose that we are interested in cars we must specify the location of the repeating element like this:

Now lets see the fields, we have different "property" elements that are differentiated by their "name" attribute, we are about to have the following fields "brand", "type" and "power" according to the "name" attribute.

For this, we must specify the association between "property" and "name" in the first grid.

Click Get Fields to retrieve the right fields including properties.

Let us now try leaving the new grid empty.

You can see that in this case the step is working like the original XMLInput and retrieve fields by their position. In this case, it is better to use value parsing, cause you get the right field names, and missing elements will not corrupt results (for example missing <property name="power"> </property> in some rows).

  • No labels