=================== SIMD Usage in Velox =================== SIMD uses special registers in CPU to operate on multiple primitive data simultaneously. In some basic cases compiler is able to translate a tight loop into SIMD instructions for us, but often we need to call the SIMD intrinsics explicitly. There are several places in Velox where we use SIMD explicitly to get better performance. We use `xsimd`_ as a zero-cost abstraction over the intrinsics, to address the portability issue. .. _xsimd: https://github.com/xtensor-stack/xsimd Architectures ------------- In Velox we support 2 families of CPU architectures regarding SIMD: X86 and ARM. In X86 there are 3 generations of SIMD technologies: SSE, AVX, and AVX512. For ARM there are NEON and SVE. Each architecture has its own size of registers, we summarize the details below: ============ ==================== ============= ========== Architecture Register Size (bits) Used in Velox CPU Family ============ ==================== ============= ========== SSE 128 Yes x86 AVX 256 Yes x86 AVX512 512 No x86 NEON 128 Yes ARM SVE 128 - 2048 No ARM ============ ==================== ============= ========== xsimd Basics ------------ The data structure in ``xsimd`` to represent a SIMD register is ``batch``. ``T`` stands for the element data type and ``A`` stands for the architecture. For example ``batch`` represents a SIMD vector containing 32 bits signed integers on AVX2. This type has only 1 field called ``data``, which is the underlying SIMD register (e.g. for AVX it can be ``__m256``, ``__m256d``, or ``__m256i``). This ensures the data structure can be optimized directly as the register without any overhead during runtime. When you compare 2 SIMD vectors (e.g. ``x == y``), there are 2 kinds of result type, depending on the architecture. For AVX512, the comparison result is kept as a bit mask (1 bit per element, up to 64 bits) in a normal integer. For all other architectures, the result is kept in another SIMD register with the same number of lanes as the operands, and each element in the result is set to either -1 (``true``) or 0 (``false``). In ``xsimd`` these 2 types are unified to one type: ``batch_bool``. ``T`` is the element type of comparison operands, and ``A`` is the architecture. ``xsimd`` provides some functions and operators to abstract the intrinsics on different architectures, including basic arithmetics, comparisons, bitwise operations, mathematical functions, loading or storing from memory. SIMD Utilities -------------- There are some intrinsics that are not yet abstracted by ``xsimd``. We added the ones commonly used in Velox in ``common/base/SimdUtil.h``. HalfBatch ~~~~~~~~~ In ``xsimd`` the vector size is decided uniquely by the architecture ``A``. In some cases we need a different size of vector though, for example in gather, if the data type is 64 bits and index type is 32 bits, the vector for indices needs to be the half size of the vector for data. To accommodate such needs, we define a type ``HalfBatch`` to get the corresponding vector type. In some cases when the default vector size is 128 bits, there is no corresponding SIMD vector of 64 bits to be used as ``HalfBatch``. In such cases we define and use a type ``Batch64``, with some methods and operators same as ``batch``, so that we can use them interchangeably. Gather ~~~~~~ Gather is an operation to load a vector from non-contiguous memory. In the simplest form, given a ``base`` address and a list of ``indices`` (saved in a SIMD vector), gather returns another vector containing all elements at :: base + indices[0] base + indices[1] ... base + indices[n] A variance of gather called ``maskGather`` takes an extra vector ``src`` and a ``batch_bool`` mask, only loads the data from corresponding memory address if ``mask[i]`` is set, otherwise uses the element in ``src[i]``. In other words, the function returns ``dst`` where :: if mask[i] dst[i] = load(base + indices[i]) else dst[i] = src[i] Bit Masks ~~~~~~~~~ As mentioned above, ``batch_bool`` is used to represent the result of a comparison, and the underlying data can be either a bit mask or a SIMD vector. To allow us manipulate this result, we provide some utilities to convert between ``batch_bool`` and bit mask (``toBitMask`` and ``fromBitMask``). Once you convert it to bit mask, you can use the normal bit manipulating operations on it. We also provide utilities like ``leadingMask`` and ``allSetBitMask`` to make it easier and faster to manipulate bits. Filter ~~~~~~ Another important function we have in ``SimdUtil.h`` is ``filter``. It takes a SIMD vector ``data`` and a ``bitMask``, then for each ``i`` where ``bitMask[i]`` is set, we move the corresponding ``data[i]`` to front and return the result. This behaves very similar to ``std::partition``. In other words, the function returns ``dst`` where :: j = 0 for i in 0 to n if bitMask[i] dst[j++] = data[i] for i in 0 to n if not bitMask[i] dst[j++] = data[i] BMI Utilities ------------- In addition to SIMD abstraction and utilities, we also have some functions that depend on BMI2 intrinsics. We define the portable version of them in ``common/base/BitUtil.h``. These functions include ``extractBits`` and ``rotateLeft``. They are relatively simple and standalone comparing to SIMD, and you can refer the documentation in the file for their usage. Use Cases --------- Hash Table ~~~~~~~~~~ In ``BigintValuesUsingHashTable::testValues`` we use SIMD to check whether multiple values are in the hash table at same time. In the hash table we use a special empty marker to indicate the value is missing. The process is following: 1. If all values are out of range, we can return all false. 2. If empty marker has been inserted into the hash table, fall back to check the values one by one. 3. Hash all valid values using SIMD multiplication and modulo, and then get the corresponding states in hash table using ``maskGather``. 4. If the state is empty marker, the value is missing; if the state is equal to value, the value is found. Otherwise we have an hash collision and need to look at next positions in hash table. If no collision is happening, we can return the result right away. 5. For each value that has collision, we use SIMD to advance multiple positions at once, until we find either value match or empty mark. Filtering ~~~~~~~~~ A typical use case for filtering values using SIMD is in ``processFixedFilter`` from ``dwio/dwrf/common/DecoderUtil.h``. This function evaluates the filter on a batch of values, and stores the passed row numbers from this batch to ``filterHits``, and the passed values to ``rawValues``. The filtering on values is done using ``Filter::testValues``, the result is stored in a bit mask. We then pass the bit mask to ``simd::filter`` to store indices and values. Finally we increase ``numValues`` with the popcount of bit mask. Note when the data type is 16 bits long, we need to do the process in 2 batches (``loadIndices(0)`` and ``loadIndices(1)``), because the indices are 32 bits long and one SIME vector is not large enough to contain all the indices needed.