Thursday, May 29, 2008

Memory sizes for data types

An earlier post of mine When less data means more RAM discussed the ways in which storage ("Symbol" space) needed for field values can increase depending on how a field is loaded or manipulated. This generated some followup questions on the QlikCommunity forum about the optimal storage sizes for fields of various data types.

What's presented below is information gleaned from the documentation, QT Support and experimentation. The numbers come from the document memory statistics file. I hope someone from QT will help me correct any errors.

QV fields have both an internal and external representation. There is a video "Datatype Handling in Qlikview" available on QlikAcademy that explores this subject.This post is concerned with the internal storage of fields.

Numbers

I've found that the storage size appears to be related to the number of total digits. Storage size in bytes, for various digit ranges:

1-10 digits, size=4
11 or more digits, size=13

The above sizes assume that the internal storage format is numeric, which is usually the case if loading from a database. Numbers loaded as text such as from a text file or inline, may be stored as strings which will occupy different sizes.


Dates, Times and Timestamps

Different Database systems provide various degrees of precision in timestamps and I assume the ODBC driver is also involved with the exact value provided to QV during the load. QV times are the fractional part of a day, using up to 9 digits to the right of the decimal point.

- Best size for a Date, 4 bytes.
- Best size for a full Time, 13 bytes.
- Best size for a full Timestamp, 13 bytes.

These sizes can increase when the field is manipulated. Want to get the date portion of a timestamp? Don't use

date(aTimestamp)

date() is a formatting function, it doesn't "extract" the underlying date portion. In many cases, it actually increases storage size because the result may be a string. Instead, use

floor(aTimestamp)

this will produce a 4 byte integer result.

A common technique for reducing the memory footprint of timestamps is to separate the timestamp into two fields, integer date and fractional time. You can further reduce the number of unique time values by eliminating the hundredths of seconds, or even eliminating the seconds if your application is ok with minute precision.


Strings

Thanks to QT support for providing this detail on Strings.

"The representation is that each symbol has a pointer (4/8 bytes on 32/64-bit platform) + the actual symbol space. This space is the number of bytes (UTF-8 representation) + 2 (1 is a flag byte and 1 is a terminating 0) + 0, 4 or 8 bytes that store the numeric representation of the field."


So on the 32bit version, a non-numeric string occupies 6 bytes more than the length of the string itself. A numeric string occupies 10 more bytes. For example:

"a" uses 7 bytes
"1" uses 11 bytes


The only way to reduce the string footprint is to reduce the number of unique values. This can be done by breaking the string into component parts if that makes sense in the application. For example, the first 3 characters of a 10 character product code may be a product class. Breaking the field into ProductClass and ProductNumber fields may reduce the number of unique values.

If the strings are keys that don't need to be displayed, the autonumber() or autonumberhash128() functions can be used to transform the values to 4 byte integers. With these functions you can also get the "sequential integer optimization" which reduces the symbols space to zero.

I've found that concatenating fields in autonumber like
autonumber(f1 & f2)
can sometimes produce false duplicates. Better to instead use autonumberhash128 like
autonumberhash128(f1, f2)
This seems to always produce correct results.


Sequential Integer Optimization

For each field, QV maintains both a Symbol table -- the unique values of a field -- and a State array that tracks which values are selected. If the symbol values are consecutive integers, a very clever optimization takes place. The Symbol space is eliminated and the State array is used to represent both selection state and value. This is a very beneficial effect of using the autonumber functions.

The values need not begin at zero for the optimization to take place, they only need to be consecutive. A set of 5000 consecutive dates will occupy no Symbol space. Take one date out of the middle and the storage reverts to the standard 4 bytes for each date.

It's not always necessary to be concerned about memory usage. But when it is, I hope this information proves useful.












4 comments:

Jay said...

I imagine that the sequential integer optimization would improve speed as well. What do you think?

I have a few datasets that will benefit from having a few missing dates filled in to take advantage of this. Thanks Rob!

Rob Wunderlich said...

I'm not sure what the direct impact on speed may be.

Indirectly, using less RAM may decrease the incidence of page faults -- having to retrieve the memory contents from the page file -- which is always a relative performance killer.

Thanks for the feedback.

atlas245 said...
This comment has been removed by a blog administrator.
Kristian Wedberg said...

Just pointing out that in my experience autonumber() with string concatenation works well _if_ you insert a delimiter between values, like so:

autonumber(f1 & ';' & f2)