============================= How to add a scalar function? ============================= Simple Functions ---------------- This document describes the main concepts, features, and examples of the simple function API in Velox. For more real-world API usage examples, check **velox/example/SimpleFunctions.cpp**. A simple scalar function, e.g. a :doc:`mathematical function `, can be added by wrapping a C++ function in a templated class. For example, a ceil function can be implemented as: .. code-block:: c++ template struct CeilFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); template FOLLY_ALWAYS_INLINE void call(T& result, const T& a) { result = std::ceil(a); } }; All simple function classes need to be templated, and provide a "call" method (or one of the variations described below). The top-level template parameter provides the type system adapter, which allows developers to use non-primitive types such as strings, arrays, maps, and struct (check below for examples). Although the top-level template parameter is not used for functions operating on primitive types, such as the one in the example above, it still needs to be specified. The call method itself can also be templated or overloaded to allow the function to be called on different input types, e.g. float and double. Note that template instantiation will only happen during function registration, described in the "Registration" section below. Do not use legacy VELOX_UDF_BEGIN and VELOX_UDF_END macros. The "call" function (or one of its variations) may return (a) void indicating the function never returns null values, or (b) boolean indicating whether the result of the computation is null. The meaning of the returned boolean is "result was set", i.e. true means non-null result was populated, false means no (null) result. If "ceil(0)" were to return a null, the function could be re-written as follows: .. code-block:: c++ template struct NullableCeilFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); template FOLLY_ALWAYS_INLINE bool call(T& result, const T& a) { result = std::ceil(a); return a != 0; // Return NULL if input is zero. } }; The argument list must start with an output parameter “result” followed by the function arguments. The “result” argument must be a reference. Function arguments must be const references. The C++ types of the function arguments and the result argument must match :doc:`Velox types`. ========== ============================== ============================= Velox Type C++ Argument Type C++ Result Type ========== ============================== ============================= BOOLEAN arg_type out_type TINYINT arg_type out_type SMALLINT arg_type out_type INTEGER arg_type out_type BIGINT arg_type out_type REAL arg_type out_type DOUBLE arg_type out_type TIMESTAMP arg_type out_type DATE arg_type out_type VARCHAR arg_type out_type VARBINARY arg_type out_type ARRAY arg_type> out_type> MAP arg_type> out_type> ROW arg_type> out_type> ========== ============================== ============================= arg_type and out_type templates are defined by the VELOX_DEFINE_FUNCTION_TYPES(TExec) macro in the struct definition. For primitive types, arg_type is the same as out_type and the same as T. This holds for boolean, integers, floating point types and timestamp. For DATE, arg_type is the same as out_type and is defined as int32_t. A signature of a function that takes an integer and a double and returns a double would look like this: .. code-block:: c++ void call(arg_type& result, const arg_type& a, const arg_type& b) Which is equivalent to .. code-block:: c++ void call(double& result, const int32_t& a, const double& b) For strings, arg_type is defined as StringView, while out_type is defined as StringWriter. arg_type and out_type for Varchar, Array, Map and Row provide interfaces similar to std::string, std::vector, std::unordered_map and std::tuple. The underlying implementations are optimized to read and write from and to the columnar representation without extra copying. More explanation and the APIs of the arg_type and out_type for string and complex types can be found in :doc:`view-and-writer-types`. Note: Do not pay too much attention to complex type mappings at the moment. They are included here for completeness. Null Behavior ^^^^^^^^^^^^^ Most functions have default null behavior, e.g. a null value in any of the arguments produces a null result. The expression evaluation engine automatically produces nulls for such inputs, eliding a call to the actual function. If a given function has a different behavior for null inputs, it must define a “callNullable” function instead of a “call” function. Here is an artificial example of a ceil function that returns 0 for null input: .. code-block:: c++ template struct CeilFunction { template FOLLY_ALWAYS_INLINE void callNullable(T& result, const T* a) { // Return 0 if input is null. if (a) { result = std::ceil(*a); } else { result = 0; } } }; Notice that callNullable function takes arguments as raw pointers and not references to allow for specifying null values. callNullable() can also return void to indicate that the function does not produce null values. Null-Free Fast Path ******************* A "callNullFree" function may be implemented in place of or along side "call" and/or "callNullable" functions. When only the "callNullFree" function is implemented, evaluation of the function will be skipped and null will automatically be produced if any of the input arguments are null (like deafult null behavior) or if any of the input arguments are of a complex type and contain null anywhere in their value, e.g. an array that has a null element. If "callNullFree" is implemented alongside "call" and/or "callNullable", an O(N * D) check is applied to the batch to see if any of the input arguments may be or contain null, where N is the number of input arguments and D is the depth of nesting in complex types. Only if it can definitively be determined that there are no nulls will "callNullFree" be invoked. In this case, "callNullFree" can act as a fast path by avoiding any per row null checks. Here is an example of an array_min function that returns the minimum value in an array: .. code-block:: c++ template struct ArrayMinFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); template FOLLY_ALWAYS_INLINE bool callNullFree( TInput& out, const null_free_arg_type>& array) { out = INT32_MAX; for (auto i = 0; i < array.size(); i++) { if (array[i] < out) { out = array[i] } } return true; } }; Notice that we can access the elements of "array" without checking their nullity in "callNullFree". Also notice that we wrap the input type in the null_free_arg_type<...> template instead of the arg_type<...> template. This is required as the input types for complex types are of a different type in "callNullFree" functions that do not wrap values in an std::optional-like interface upon access. Determinism ^^^^^^^^^^^ By default simple functions are assumed to be deterministic, e.g. given the same inputs they always produce the same results. If this is not the case, the function must define a static constexpr bool is_deterministic member: .. code-block:: c++ static constexpr bool is_deterministic = false; An example of such function is rand(): .. code-block:: c++ template struct RandFunction { static constexpr bool is_deterministic = false; FOLLY_ALWAYS_INLINE bool call(double& result) { result = folly::Random::randDouble01(); return true; } }; All-ASCII Fast Path ^^^^^^^^^^^^^^^^^^^ Functions that process string inputs must work correctly for UTF-8 inputs. However, these functions often can be implemented more efficiently if input is known to contain only ASCII characters. Such functions can provide a “call” method to process UTF-8 strings and a “callAscii” method to process ASCII-only strings. The engine will check the input strings and invoke “callAscii” method if input is all ASCII or “call” if input may contain multi-byte characters. In addition, most functions that take string inputs and produce a string output have so-called default ASCII behavior, e.g. all-ASCII input guarantees all-ASCII output. If that’s the case, the function can indicate so by defining the is_default_ascii_behavior member variable and initializing it to true. The engine will automatically mark the result strings as all-ASCII. When these strings are passed as input to some other function, the engine won’t need to scan the strings to determine whether they are ASCII or not. Here is an example of a trim function: .. code-block:: c++ template struct TrimFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); // ASCII input always produces ASCII result. static constexpr bool is_default_ascii_behavior = true; // Properly handles multi-byte characters. FOLLY_ALWAYS_INLINE bool call( out_type& result, const arg_type& input) { stringImpl::trimUnicodeWhiteSpace(result, input); return true; } // Assumes input is all ASCII. FOLLY_ALWAYS_INLINE bool callAscii( out_type& result, const arg_type& input) { stringImpl::trimAsciiWhiteSpace(result, input); return true; } }; Zero-copy String Result ^^^^^^^^^^^^^^^^^^^^^^^ Functions like :func:`substr` and :func:`trim` can produce zero-copy results by referencing input strings. To do that they must define a reuse_strings_from_arg member variable and initialize it to the index of the argument whose strings are being re-used in the result. This will allow the engine to add a reference to input string buffers to the result vector and ensure that these buffers will not go away prematurely. The output types can be scalar strings (varchar and varbinaries), but also complex types containing strings, such as arrays, maps, and rows. The setNoCopy method of the out_type template can be used to set the result to a string in the input argument without copying. The setEmpty method can be used to set the result to an empty string. .. code-block:: c++ // Results refer to strings in the first argument. static constexpr int32_t reuse_strings_from_arg = 0; Here is an example of a zero-copy function: .. code-block:: c++ template struct TrimFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); // Results refer to strings in the first argument. static constexpr int32_t reuse_strings_from_arg = 0; FOLLY_ALWAYS_INLINE void call( out_type& result, const arg_type& input) { if (input.size() == 0) { result.setEmpty(); return; } result.setNoCopy(stringImpl::trimUnicodeWhiteSpace(input)); } }; Access to Session Properties and Constant Inputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Some functions require access to session properties such as session’s timezone. Some examples are the :func:`day`, :func:`hour`, and :func:`minute` Presto functions. Other functions could benefit from pre-processing some of the constant inputs, e.g. compile regular expression patterns or parse date and time units. To get access to session properties and constant inputs the function must define an initialize method which receives a constant reference to QueryConfig and a list of constant pointers for each of the input arguments. Constant inputs will have their values specified. Inputs which are not constant will be passed as nullptr's. The signature of the initialize method is similar to that of callNullable method with an additional first parameter const core::QueryConfig&. The engine calls the initialize method once per query and thread of execution. Here is an example of an hour function extracting time zone from the session properties and using it when processing inputs. .. code-block:: c++ template struct HourFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); const date::time_zone* timeZone_ = nullptr; FOLLY_ALWAYS_INLINE void initialize( const std::vector& inputTypes, const core::QueryConfig& config, const arg_type* /*timestamp*/) { timeZone_ = getTimeZoneFromConfig(config); } FOLLY_ALWAYS_INLINE bool call( int64_t& result, const arg_type& timestamp) { int64_t seconds = getSeconds(timestamp, timeZone_); std::tm dateTime; gmtime_r((const time_t*)&seconds, &dateTime); result = dateTime.tm_hour; return true; } }; Here is another example of the :func:`date_trunc` function parsing the constant unit argument during initialize and re-using parsed value when processing individual rows. .. code-block:: c++ template struct DateTruncFunction { VELOX_DEFINE_FUNCTION_TYPES(TExec); const date::time_zone* timeZone_ = nullptr; std::optional unit_; FOLLY_ALWAYS_INLINE void initialize( const std::vector& inputTypes, const core::QueryConfig& config, const arg_type* unitString, const arg_type* /*timestamp*/) { timeZone_ = getTimeZoneFromConfig(config); if (unitString != nullptr) { unit_ = fromDateTimeUnitString(*unitString); } } FOLLY_ALWAYS_INLINE bool call( out_type& result, const arg_type& unitString, const arg_type& timestamp) { const auto unit = unit_.has_value() ? unit_.value() : fromDateTimeUnitString(unitString); ...... } }; If the :func:`initialize` method throws, the exception will be captured and reported as output for every single active row. If there are no active rows, the exception will not be raised. Registration ^^^^^^^^^^^^ Use registerFunction template to register simple functions. .. code-block:: c++ template