CompactRow¶
CompactRow is a row-wise serialization format provided by Velox as an alternative to UnsafeRow format. CompactRow is more space efficient then UnsafeRow and results in fewer bytes shuffled which has a cascading effect on CPU usage (for compression and checksumming) and memory (for buffering).
A row is a contiguous buffer that starts with null flags, followed by individual fields.
nulls | field1 | field 2 | …
Nulls section uses one bit per field to indicate which fields are null. If there are 10 fields, there will be 2 bytes of null flags (16 bits total, 10 bits used, 6 bits unused).
Fixed-width fields (integers, boolean, floating point numbers) take up a fixed number of bytes regardless of whether they are null or not. A row with 10 bigint fields takes up 2 + 10 * 8 = 82 bytes. 2 bytes for null flags + 8 bytes per field.
The sizes of fixed-width fields are:
Type |
Number of bytes used for serialization |
---|---|
BOOLEAN |
1 |
TINYINT |
1 |
SMALLINT |
2 |
INTEGER |
4 |
BIGINT |
8 |
HUGEINT |
16 |
REAL |
4 |
DOUBLE |
8 |
TIMESTAMP |
8 |
UNKNOWN |
0 |
Timestamps are serialized with microsecond precision to align with Spark’s handling of timestamps.
Strings (VARCHAR and VARBINARY) use 4 bytes for size plus the length of the string. Empty string uses 4 bytes. 1-character string uses 5 bytes. 20-character ASCII string uses 24 bytes. Null strings do not take up space (other than one bit in the nulls section).
Arrays of fixed-width values or strings, e.g. arrays of integers, use 4 bytes for the size of the array, a few bytes for nulls flags indicating null-ness of the elements (1 bit per element) plus the space taken by the elements themselves.
For example, an array of 5 integers [1, 2, 3, 4, 5] uses 4 bytes for size, 1 byte for 5 null flags and 5 * 4 bytes for 5 values. A total of 25 bytes.
Description |
Size |
Nulls |
Elem 1 |
Elem 2 |
Elem 3 |
Elem 4 |
Elem 5 |
---|---|---|---|---|---|---|---|
# of bytes |
4 |
1 |
4 |
4 |
4 |
4 |
4 |
Value |
5 |
00000000 |
1 |
2 |
3 |
4 |
5 |
An array of 4 strings [null, “Abc”, null, “Mountains and rivers”] uses 36 bytes:
Description |
Size |
Nulls |
Size s2 |
s2 |
Size s4 |
s4 |
---|---|---|---|---|---|---|
# of bytes |
4 |
1 |
4 |
3 |
4 |
20 |
Value |
4 |
10100000 |
1 |
Abc |
20 |
Mountains and rivers |
Serialization of an array of complex type elements, e.g. an array of arrays, maps or structs, includes a few additional fields: the total serialized size plus offset of each element in the serialized buffer.
4 bytes - array size.
N bytes - null flags, 1 bit per element.
4 bytes - Total serialized size of the array excluding first 2 fields (size and nulls).
4 bytes per element - Offsets of the elements in the serialized buffer relative to the position right after the total serialized size.
Elements.
For example, an array of integers [[1, 2, 3], [4, 5], [6]] uses N bytes:
4 bytes - size - 3
1 byte - nulls - 00000000
4 bytes - total serialized size - 55
4 bytes - offset of the 1st element - 12
4 bytes - offset of the 2nd element - 29
4 bytes - offset of the 3rd element - 42
—– Start of the 1st element: [1, 2, 3]
4 bytes - size - 3
1 byte - nulls - 00000000
4 bytes - element 1 - 1
4 bytes - element 2 - 2
4 bytes - element 3 - 3
—– Start of the 2nd element: [4, 5]
4 bytes - size - 2
1 byte - nulls - 00000000
4 bytes - element 1 - 4
4 bytes - element 2 - 5
—– Start of the 2nd element: [6]
4 bytes - size - 1
1 byte - nulls - 00000000
4 bytes - element 1 - 6
A map is serialized as the keys array followed by the values array.
A struct is serialized the same as the top-level row.
Compared to UnsafeRow, on average CompactRow serialization is about twice shorter. Some examples are:
Type |
UnsafeRow |
CompactRow |
---|---|---|
INTEGER |
8 |
4 |
BIGINT |
8 |
8 |
REAL |
8 |
4 |
DOUBLE |
8 |
8 |
VARCHAR: “” (empty) |
8 |
4 |
VARCHAR: “Abc” |
16 |
7 |