Monday, May 19, 2008

When less data means more RAM

I attended Niklas Boman's excellent Performance Tuning talk at Qonnections in Miami. One of his tuning recommendations was to reduce the number of rows and columns when possible. This will probably always have a positive impact on chart calculation time, but if done incorrectly, reducing the quantity of data can have an adverse impact on RAM usage.

Consider a QVD file with one million rows. The QVD was loaded from a database and contains two fields:

aNum – unique integers, 1M unique values.
aDate – dates distributed equally throughout 2000-2003, 1,460 unique values.


QV stores each of these values as integers, occupying 4 bytes of RAM each. Nice and compact.

Which of the following statements will create a QVW that uses more RAM? Statement A, which loads 1000K rows or Statement B, which loads only 750K rows?

Statement A:// Load all 1,000,000 rows
LOAD * FROM qvdData.qvd (qvd);


Statement B:
// LOAD only 2001+ which should be 750,000 rows
LOAD * FROM qvdData.qvd (qvd)
WHERE year(aDate) > 2000;

Pat yourself on the back if you answered “B”. B will use more RAM! More RAM for less data? Why? Because “B” causes an unoptimized load which results in QV converting the Integer representations of the data to String representation.

QV can load QVDs in one of two modes – Optimized or Unoptimized (more in the Ref Guide). In an optimized load, the RAM image from the QVD is loaded directly into memory. An optimized load is indicated in the Loading message in the progress window. (Note to development: would be nice if the optimized message appeared in the log as well).

In unoptimized mode, the QVD image is “unwrapped” and the data processed discretly. This causes the internal formatting to be lost and the data is stored internally as Mixed. So each “aNum” that previously occupied 4 bytes, now takes 9 bytes. “aDate” now averages 18.96 bytes each.

It's the WHERE clause that forces the unoptimized load. Generally, adding fields or anything that causes a field value to be examined will force an unoptimized load. Examples of unoptimzed loads:

LOAD *, year(aDate) as Year FROM qvdData.qvd (qvd) ;
LOAD *, rowno() as rowid FROM qvdData.qvd (qvd)


Even a WHERE clause that does not reference any field will be unoptimized:
LOAD * FROM qvdData.qvd (qvd)
WHERE 1=1;


How can you tell how much RAM a field is using? “Document Settings, General, Memory Statistics” button will generate a .mem text file that contains a storage size for the “Symbols” (values) of each field. You can view the .mem file directly or load it into a QVW for processing. The 8.5 beta provides a “Qlikview Optimizer.qvw” for just this purpose. I've uploaded this file to the “Share Qlikviews” section of QlikCommunity if you don't have 8.5.

WorkaroundsI've found that I can usually “fix” the field by setting the desired format in the Document Properties and checking the “Survive Reload” box. You can also apply formats in the load script, but I find this tedious if I have more than a few fields. Here are some alternative workarounds.

To create additional fields, use a RIGHT JOIN after the optimized load.
Instead of:
LOAD *, year(aDate) as Year FROM qvdData.qvd (qvd);

Use:
tdata:
LOAD * FROM qvdData.qvd (qvd);
RIGHT JOIN LOAD DISTINCT *, year(aDate) as Year
RESIDENT tdata;


For a subset selection, version 8 allows an optimized load using where exists() if the exists clause refers to only a single field. This means you'll have to generate the desired values before the load using the same field name. Something like this:

//Generate table of the dates we want 2001-2004

LET vStartDate=num(MakeDate(2001,1,1)-1);
LET vEndDate=num(MakeDate(2004,12,31));
DateMaster:
LOAD date($(vStartDate) + IterNo()) as aDate
AUTOGENERATE 1
WHILE $(vStartDate) + IterNo() <= $(vEndDate);

// Optimized load of the subset dates
tdata:
LOAD * FROM qvdData.qvd (qvd) WHERE exists(aDate);
DROP TABLE DateMaster; // No longer needed

In some cases, the above example will give you an additional optimization. Something I call the “sequential integer optimization” which I'll discuss on another day.

Worrying about RAM is not always necessary and many times is not worth the effort, especially if it makes your script harder to follow. However, for large datasets, particularly in the 32bit environment, you may be forced to optimize RAM usage. Using the mem files allows you to identify the most productive candidates for tuning.

The QV Reference Guide points out that an optimized load will run faster than an unoptimized load. I think it would be useful to have brief discussion of the impact on RAM usage as well.

3 comments:

Jay said...

Great post, Rob! That trick with EXISTS() will come in handy.

Anonymous said...

Very informative.
I had a different problem now after generating memory statistics and am actually confused.

We have a massive massive data set. Generating the report - I saw sum of bytes comes up to 7GB (whole model - all tables, records, variables everything)

However, QV instance in memory is consuming 17 GB. Where is this additional memory going to.

I cannot post my office email id here - but I will reply to your post in qliktech forums in reply to your post on topic
"Incremental load performance"

thanks,
AD

Rob Wunderlich said...

AD,

I haven't had experience with using mem stats on something quite that large. One thing that stands out in your comment is the 7GB vs 17GB. Is is possible that the mem files or value shown in the optimizer are truncating a digit?