Friday, January 14, 2011

Incremental Load using SQL Server "timestamp" Data Type

Incremental Load (extracting only new or changed rows from the database) requires a table column that identifies when a row has been updated. This is usually a datetime column like "LastUpdate". The script extracts the max timestamp from the existing QVD and uses it to create a select predicate like:

WHERE LastUpdate >= '01-20-2010 13:55:01'

If the database is Microsoft SQL Server, you may have another option for identifying updated rows. A table column of type "timestamp" is incremented whenever a row is changed. The name "timestamp" in somewhat confusing, as this column does not contain time values. Here is the description from the 2005 Reference:

"timestamp is a data type that exposes automatically generated, unique binary numbers within a database. timestamp is generally used as a mechanism for version-stamping table rows. The storage size is 8 bytes. The timestamp data type is just an incrementing number and does not preserve a date or a time. To record a date or time, use a datetime data type."

In Sql  Server  2008, the type "rowversion" was introduced as an alias for "timestamp". Rowversion is preferred for 2008 and forward.

If your table does not already have a timestamp column, you can add one with an ALTER DDL statement, For example, to add a timestamp column named "Revision" to a table named "Orders":

alter table Orders ADD [Revision] [timestamp] NOT NULL;

The Revision column will be returned by default as a hexadecimal string, so it's easiest to convert to an integer before storing in QV. 

    OrderID, etc, 
    cast(Revision as bigint) as Revision
FROM Orders ...

In subsequent incremental loads, add a predicate to select only rows greater than the last revision. 

// Select the max value from the qvd
LOAD max(Revision) as Revision FROM myqvd.qvd (qvd);
LET vMaxRevision = PEEK('Revision');
DROP TABLE tempmax:


  OrderID, etc,
  cast(Revision as bigint) as Revision
FROM Orders
WHERE Revision > $(vMaxRevision);

I find the timestamp value, when available, to be easier to use than a datetime column. It's just a numeric, so no literal formatting is required. 

Because it's a precise and unique value, you avoid the "always one row" problem. When selecting from a datetime, you usually have to specify ">=" because a datetime is not a unique value. This means that a select will return at least one row, even if there were no real updates. 

With timestamp, you can specfy ">" which will return zero rows if there were no updates. Knowing that you had no updates can be useful in controlling the execution of further processing steps.