SIMD Usage in Velox¶
SIMD uses special registers in CPU to operate on multiple primitive data simultaneously. In some basic cases compiler is able to translate a tight loop into SIMD instructions for us, but often we need to call the SIMD intrinsics explicitly.
There are several places in Velox where we use SIMD explicitly to get better performance. We use xsimd as a zero-cost abstraction over the intrinsics, to address the portability issue.
Architectures¶
In Velox we support 2 families of CPU architectures regarding SIMD: X86 and ARM. In X86 there are 3 generations of SIMD technologies: SSE, AVX, and AVX512. For ARM there are NEON and SVE. Each architecture has its own size of registers, we summarize the details below:
Architecture |
Register Size (bits) |
Used in Velox |
CPU Family |
---|---|---|---|
SSE |
128 |
Yes |
x86 |
AVX |
256 |
Yes |
x86 |
AVX512 |
512 |
No |
x86 |
NEON |
128 |
Yes |
ARM |
SVE |
128 - 2048 |
No |
ARM |
xsimd Basics¶
The data structure in xsimd
to represent a SIMD register is batch<T, A>
.
T
stands for the element data type and A
stands for the architecture.
For example batch<int32_t, avx2>
represents a SIMD vector containing 32 bits
signed integers on AVX2. This type has only 1 field called data
, which is
the underlying SIMD register (e.g. for AVX it can be __m256
, __m256d
, or
__m256i
). This ensures the data structure can be optimized directly as the
register without any overhead during runtime.
When you compare 2 SIMD vectors (e.g. x == y
), there are 2 kinds of result
type, depending on the architecture. For AVX512, the comparison result is kept
as a bit mask (1 bit per element, up to 64 bits) in a normal integer. For all
other architectures, the result is kept in another SIMD register with the same
number of lanes as the operands, and each element in the result is set to
either -1 (true
) or 0 (false
). In xsimd
these 2 types are unified
to one type: batch_bool<T, A>
. T
is the element type of comparison
operands, and A
is the architecture.
xsimd
provides some functions and operators to abstract the intrinsics on
different architectures, including basic arithmetics, comparisons, bitwise
operations, mathematical functions, loading or storing from memory.
SIMD Utilities¶
There are some intrinsics that are not yet abstracted by xsimd
. We added
the ones commonly used in Velox in common/base/SimdUtil.h
.
HalfBatch¶
In xsimd
the vector size is decided uniquely by the architecture A
. In
some cases we need a different size of vector though, for example in gather, if
the data type is 64 bits and index type is 32 bits, the vector for indices needs
to be the half size of the vector for data. To accommodate such needs, we
define a type HalfBatch<T, A>
to get the corresponding vector type.
In some cases when the default vector size is 128 bits, there is no
corresponding SIMD vector of 64 bits to be used as HalfBatch
. In such cases
we define and use a type Batch64<T>
, with some methods and operators same as
batch<T, A>
, so that we can use them interchangeably.
Gather¶
Gather is an operation to load a vector from non-contiguous memory. In the
simplest form, given a base
address and a list of indices
(saved in a
SIMD vector), gather returns another vector containing all elements at
base + indices[0]
base + indices[1]
...
base + indices[n]
A variance of gather called maskGather
takes an extra vector src
and a
batch_bool
mask, only loads the data from corresponding memory address if
mask[i]
is set, otherwise uses the element in src[i]
. In other words,
the function returns dst
where
if mask[i]
dst[i] = load(base + indices[i])
else
dst[i] = src[i]
Bit Masks¶
As mentioned above, batch_bool
is used to represent the result of a
comparison, and the underlying data can be either a bit mask or a SIMD vector.
To allow us manipulate this result, we provide some utilities to convert between
batch_bool
and bit mask (toBitMask
and fromBitMask
). Once you
convert it to bit mask, you can use the normal bit manipulating operations on
it. We also provide utilities like leadingMask
and allSetBitMask
to
make it easier and faster to manipulate bits.
Filter¶
Another important function we have in SimdUtil.h
is filter
. It takes a
SIMD vector data
and a bitMask
, then for each i
where bitMask[i]
is set, we move the corresponding data[i]
to front and return the result.
This behaves very similar to std::partition
. In other words, the function
returns dst
where
j = 0
for i in 0 to n
if bitMask[i]
dst[j++] = data[i]
for i in 0 to n
if not bitMask[i]
dst[j++] = data[i]
BMI Utilities¶
In addition to SIMD abstraction and utilities, we also have some functions that
depend on BMI2 intrinsics. We define the portable version of them in
common/base/BitUtil.h
. These functions include extractBits
and
rotateLeft
. They are relatively simple and standalone comparing to SIMD,
and you can refer the documentation in the file for their usage.
Use Cases¶
Hash Table¶
In BigintValuesUsingHashTable::testValues
we use SIMD to check whether
multiple values are in the hash table at same time. In the hash table we use a
special empty marker to indicate the value is missing. The process is
following:
If all values are out of range, we can return all false.
If empty marker has been inserted into the hash table, fall back to check the values one by one.
Hash all valid values using SIMD multiplication and modulo, and then get the corresponding states in hash table using
maskGather
.If the state is empty marker, the value is missing; if the state is equal to value, the value is found. Otherwise we have an hash collision and need to look at next positions in hash table. If no collision is happening, we can return the result right away.
For each value that has collision, we use SIMD to advance multiple positions at once, until we find either value match or empty mark.
Filtering¶
A typical use case for filtering values using SIMD is in processFixedFilter
from dwio/dwrf/common/DecoderUtil.h
. This function evaluates the filter on
a batch of values, and stores the passed row numbers from this batch to
filterHits
, and the passed values to rawValues
.
The filtering on values is done using Filter::testValues
, the result is
stored in a bit mask. We then pass the bit mask to simd::filter
to store
indices and values. Finally we increase numValues
with the popcount of bit
mask.
Note when the data type is 16 bits long, we need to do the process in 2 batches
(loadIndices(0)
and loadIndices(1)
), because the indices are 32 bits
long and one SIMD vector is not large enough to contain all the indices needed.