Access Keys:
Skip to content (Access Key - 0)

Description

This step provides the ability to read data from any type of XML file using XPath specifications.

For more information on XPath, see this tutorial: http://www.w3schools.com/XPath/ or the W3C Recommendation: http://www.w3.org/TR/xpath

"Get Data From XML" can read data from 3 kind of sources (files, stream and url) in 2 modes (user can define files and urls at static mode or in a dynamic way). The following sample transformation will show you the differents options with this  step :

samples/transformations/GetXMLData - Differents Options.ktr 

Note for HTTP URL requests: You should be aware that this component will re-query the URL for each lookup request.  Therefore if you are using this component to query a HTTP URL for an XML document, you should consider using the "Loop XPath" and "Prune path" parameters - these will help you avoid re-querying for the XML document.  Also note that this component seems to make a HTTP HEAD request prior to the actual HTTP GET - if your XML document is a dynamic query (ASP/PHP/etc.), you should be aware that this HTTP HEAD request may result in an additional call to your URL.

Please see also the XML Input Stream (StAX) step that uses a completely different approach to solve use cases with very big and complex data stuctures and the need for very fast data loads.

Files Tab

The files tab is where you define the location of the XML files from which you want to read. The table below contains options associated with the Files tab:

Option Description
Step Name Name of the step; the name has to be unique in a single transformation.
XML Source from field
  • XML source is defined in a field : the previous step is giving XML data in a certain field in the input stream.
  • XML source is a filename : the previous step is giving filenames in a certain field in the input stream.  These are read.
  • Read source as URL : the  previous step is giving URLs in a certain field in the input stream.  These are read.
  • Get XML source from a field : specify the field to read XML, filename or URL from.
File or directory Specifies the location and/or name of the input text file. Note: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below.
Regular expression Specifies the regular expression you want to use to select the files in the directory specified in the previous option.
Selected Files Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped.
Show filename(s)... Displays a list of all files that will be loaded based on the current selected file definitions

Content Tab

Option Description
Settings
  • Loop XPath : For every "Loop XPath" location we find in the XML file(s), we will output one row of data.  This is the main specification we use to flatten the XML file(s).  You can use the "Get XPath nodes" button to search for the possible repeating nodes in the XML document.  Please note that if the XML document is large that this can take a while.
  • Encoding : the XML filename encoding in case none is specified in the XML documents. (yes, those still exist)
  • Namespace aware : check this to make the XML document namespace aware.
  • Ignore comments : Ignore all comments in the XML document while parsing.
  • Validate XML : Validate the XML prior to parsing. Use a token when you want to replace dynamically in a Xpath field value. A token is between @_ and - (@_fieldname-). Please see the Example 1 to see how it works.
  • Use token : a token is not related tro XML parsing but to PDI.
  • Igore empty file : an empty file is not a valid XML document.  Check this if you want to ignore those altogether.
  • Do not raise an error if no file: Don't raise a stink if no files are found.
  • Limit : Limits the number of rows to this number (zero (0) means all rows).
  • Prune path to handle large files: almost the same value as the "Loop XPath" property with some exceptions, see Get Data from XML - Handling Large Files for more details. Note that you can use this parameter to avoid multiple HTTP URL requests.
Additional fields
  • Include filename in output? : Allows you to specify a field name to include the file name (String) in the output of this step.
  • Rownum in output? : Allows you to specify a field name to include the row number (Integer) in the output of this step.
Add to result filename
  • Add files to result filename : Adds the XML filenames read to the result of this transformation.  A unique list is being kept in memory that can be used in the next job entry in a job, for example in another transformation.

Fields

Option Description
Name The name of the output field
XPath The path to the element node or attribute to read
Element The element type to read: Node or Attribute
Type
The data type to convert to
Format The format or conversion mask to use in the data type conversion
Length The length of the output data type
Precision The precision of the output data type
Currency The currency symbol to use during data type conversion
Decimal The numeric decimal symbol to use during data type conversion
Group The numeric grouping symbol to use during data type conversion
Trim type The type of trimming to use during data type conversion
Repeat Repeat the column value of the previous row if the column value is empty (null)

Example 1 : basic example

This sample is also available in the distribution :

samples/transformations/GetXMLData - Basic reading flat XML.ktr
Data read:
<Level1>
 <Level2>
   <Props>
    <ObjectID>AAAAA</ObjectID>
    <SAPIDENT>31-8200</SAPIDENT>
    <Quantity>1</Quantity>
    <Merkmalname>TX_B</Merkmalname>
    <Merkmalswert> 600</Merkmalswert>
   </Props>
   <Props>
    <ObjectID>AAAAA</ObjectID>
    <SAPIDENT>31-8200</SAPIDENT>
    <Quantity>3</Quantity>
    <Merkmalname>TX_B</Merkmalname>
    <Merkmalswert> 900</Merkmalswert>
   </Props>
 </Level2></Level1>
Options set:
  • Loop XPath :  Level1/Level2/Props
  • Encoding : UTF-8 (non supplied on purpose)
  • Field XPaths:
    • ObjectID
    • SAPIDENT
    • Quantity
    • Merkmalname
    • Merkmalswert
Screenshots:
File

Content

Example 2 :reading from parent nodes

This sample is also available in the distribution :

samples/transformations/GetXMLData - Read parent children rows.ktr 
 Data read:
<database>
	<item id="111" clientName="Sven Boden">
		<detail child_id="1">
		    <name>Bisoprolol</name>
		    <amount>2</amount>
		</detail>
		<detail child_id="2">
		    <name>Aspirine</name>
		    <amount>20</amount>
		</detail>
	</item>
	<item id="112" clientName="Gusia Jakobczyk">
		<detail child_id="1">
		    <name>Libitor</name>
		    <amount>1</amount>
		</detail>
	</item>
</database>
Options set:

In this case, we want to read data from looping node "database/item/detail".  However, we also want information from parent node "database/item", more to the point, the attributes there, for example the id.

  • Loop XPath: database/item/detail
  • Encoding : UTF-8 (non supplied on purpose)
  • Field XPaths:
    • ../@id : reads IDs 111, 112
    • @child_id : reads IDs 1, 2 and 1
    • name
    • amount

Note : @ is a shortcut to specify that you want an attribute.

Example 3 :how to use tokens

Let's consider the sample transformation :

samples/transformations/GetXMLData - Differents Options.ktr 

In this sample, the first transformation
uses tokens to parse the following XML file structure :

<Session_header Code="xmltest">
   <InfoSession>
      <InfoSessionData User="Username1">
         <Data>Data1</Data>
      </InfoSessionData>
   </InfoSession>
   <InfoSession>
      <InfoSessionData User="Username2">
         <Data>Data2</Data>
      </InfoSessionData>
   </InfoSession>
   <InfoSession>
      <InfoSessionData User="Username3">
         <Data>Data3</Data>
      </InfoSessionData>
   </InfoSession>
   <Session>
      <SessionData>
         <Parameter>
            <User>Username1</User>
            <Password>password123</Password>
         </Parameter>
         <OutputData>
            <Step>
               <Error_Message>Please have transaction ID and related information ready. Timestamp : 20070731 01:50:06 Session : 0201</Error_Message>
            </Step>
         </OutputData>
      </SessionData>
      <SessionData>
         <Parameter>
            <User>Username2</User>
            <Password>password345</Password>
         </Parameter>
         <OutputData>
            <Step>
               <Error_Message>error occurred Timestamp : 20070731 01:53:25</Error_Message>
            </Step>
         </OutputData>
      </SessionData>
      <SessionData>
         <Parameter>
            <User>Username3</User>
            <Password>password567</Password>
         </Parameter>
         <OutputData>
            <Step>
               <Error_Message>Transaction Id :361163328</Error_Message>
               <Status>Processed according to contract/plan provisions</Status>
            </Step>
         </OutputData>
      </SessionData>
   </Session>
</Session_header>

We want to extract all SessionData, so the Loop Xpath will be Session_header/Session/SessionData. We want to get the followings fields : Code, User, Password, Status, Error_Message, Data. Thanks to Xpath, it's easy the return the five first fields. In fact, the last field (Data) is out of the Loop Xpath (InfoSession) and depends of the User. So first you have to go in InfoSession (../../InfoSession) and return Data for each User (<Data></Data>). 

As we have previously the value for each User (see User field), we need to return Data for each User. The syntax is : ../../InfoSession/InfoSessionData[@User=the current user value]/Data. We have to find a way to notify to PDI using the value of a field in a Xpath expression : this is what tokens are used for. To use in our case the User value in an expression; we will write @_User- and the expression for the field Data will became ../../InfoSession/InfoSessionData[@User=@_User-]/Data. To finish do not forget to check "Use tokens" option.

Note: The tokenized field have to be defined previously and at this time, you can not use more than one token per field.


  1. Feb 22, 2011

    Don Grodecki says:

    Get Data from XML can read a stream with a field that contains XML. It would be ...

    Get Data from XML can read a stream with a field that contains XML. It would be useful if it could output such a stream as well, or if some other step could do this. All that would be needed is to handle the situation where the xpath for a field resolves to an XML element rather than a text node or attribute. In this situation output the resultant XML snippet in the field. Use "string" as a field type, or to be extra fancy add an "XML" field type.

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Adaptavist Theme Builder (4.2.0) Powered by Atlassian Confluence 3.3.3, the Enterprise Wiki