Added by Mark Hall, last edited by Mark Hall on Dec 04, 2011  (view change)

Labels:

input input Delete
cassandra cassandra Delete
hadoop_input hadoop_input Delete
Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Contents

1. Introduction

Cassandra Input is an input step for PDI >= 4.2 that allows data to be read from an Cassandra column family (table) as part of an ETL transformation. This step and its documentation are currently under development. It is part of the CE Hadoop plugin for PDI and you can experiment with it by downloading a snapshot build from our continuous integration server:

http://ci.pentaho.com/view/Data%20Integration/job/pdi-hadoop-plugin/

  • unzip pdi-hadoop-plugin-TRUNK-SNAPSHOT.zip in the "plugins" directory of your PDI installation
  • restart Spoon

The step has been developed against, and tested with, Cassandra version 1.0.0.

2. Using the Cassandra Input step

This section describes basic usage of the step - in particular how to connect to Cassandra and execute a CQL (Cassandra Query Language) query to retrieve rows from a column family (table).

2.1 Configuring Cassandra connection details

The screenshot below shows the GUI dialog for the Cassandra Input step.

The UI allows the user to provide connection details and a query in CQL (Cassandra Query Language). The Cassandra host and Cassandra port fields hold basic connection details. Below this, the Username and Password field can be used to provide authentication details if the source keyspace and/or column family (table) require it. The Keyspace field is used to provide the name of the keyspace (database) to be accessed in the query. The Use query compression check box tells the step whether or not to compress the text of the CQL query before sending it to the server.

2.2 CQL SELECT query

The large text box at the bottom of the dialog allows the user to enter a CQL SELECT statement to be executed. Only a single SELECT query is accepted by the step. 

SELECT [FIRST N] [REVERSED] <SELECT EXPR> FROM <COLUMN FAMILY> [USING <CONSISTENCY>] [WHERE <CLAUSE>] [LIMIT N];

Select queries may name columns explicitly (in a comma separated list) or use the * wildcard. If wildcard is used then only those columns (if any) defined in the meta data for the column family in question are returned. If columns are selected explicitly, then the name of each column must be enclosed in single quotation marks. Since Cassandra is a sparse column oriented database (like HBase), it is possible for rows to contain varying numbers of columns which might, or might not, be defined in the meta data for the column family. The Cassandra Input step can emit columns that are not defined in the meta data for the column family in question if they are explicitly named in the SELECT clause. Cassandra Input uses type information present in the meta data for a column family. This, at a minimum, includes a default type (column validator) for the column family. If there is explicit meta data for individual columns available, then this is used for type information, otherwise the default validator is used.

IMPORTANT: Cassandra Input does not support the CQL range notation (e.g. name1..nameN) for specifying columns in a SELECT query.

2.2.1 LIMIT

If omitted, Cassandra assumes a default limit of 10,000 rows to be returned by the query. If the query is expected to return more than 10,000 rows an explicit LIMIT clause must be added to the query.

2.2.2 FIRST N and REVERSED

FIRST N returns the first N (as determined by the column sorting strategy used for the column family in question) column values from each row. If the column family in question is sparse then this may result in a different N (or less) column values appearing from one row to the next. Since PDI deals with a constant number of fields between steps in a transformation, Cassandra rows that do not contain particular columns are output as rows with null field values for non-existent columns. Cassandra's default for FIRST (if omitted from the query) is 10,000 columns - if a query is expected to return more than 10,000 columns then an explicit FIRST must be added to the query.

The REVERSED option causes the sort order of the columns returned by Cassandra for each row to be reversed. This may affect which values result from a FIRST N option, but does not affect the order of the columns output by Cassandra Input.

2.2.3 WHERE clause

A WHERE clause may be used to filter rows that appear in the results.

SELECT ... WHERE KEY = keyname AND name1 = value1
SELECT ... WHERE KEY >= startkey and KEY =< endkey AND name1 = value1
SELECT ... WHERE KEY IN ('<key>', '<key>', '<key>', ...)

The WHERE clause provides for filtering the rows that appear in results. The clause can filter on a key name, or range of keys, and in the case of indexed columns, on column values. Key filters are specified using the KEY keyword, a relational operator, (one of =, >, >=, <, and <=), and a term value. When terms appear on both sides of a relational operator it is assumed the filter applies to an indexed column. With column index filters, the term on the left of the operator is the name, the term on the right is the value to filter on. When filtering on indexed columns, there must be at least one equality operator present. Note that using inequality operators result in ranges that are inclusive of the terms (i.e. > is the same as >=, and < is the same as <=).

2.3 Viewing column family meta data

The Show schema button at the lower right-hand side of the UI pops up a dialog that shows meta data for the column family named in the CQL SELECT query.