========== Anti joins ========== Anti joins can be used to efficiently implement queries with NOT IN and NOT EXISTS clauses. The semantics of these queries are different in the presence of NULLs in the outer query or subquery. NOT IN semantics are implemented by the null aware anti join. NOT EXISTS semantics are implemented by the regular anti join. Velox provides regular anti join via ``JoinType::kAnti`` and null-aware anti join via ``JoinType::kNullAwareAnti``. This article explains the differences in semantics between NOT IN and NOT EXISTS queries and discusses the implementation of these in the null aware and regular anti joins. NOTE: Presto optimizer doesn’t plan NOT IN and NOT EXISTS queries using anti joins. To compensate, query plan translation in Prestissimo detects a NOT IN pattern and `converts `_ it into Velox anti join plan node. NOTE: `Substrait `_ currently defines only one type of anti join and doesn’t specify whether the semantics are NOT IN or NOT EXISTS. This topic is in active `discussion `_. It is easiest to understand the semantic differences using examples. Consider table t: ==== ===== id value ==== ===== NULL 0 1 1 2 2 ==== ===== and table u: ==== ===== id value ==== ===== NULL 0 2 2 3 3 ==== ===== When experimenting with queries, it is convenient to use WITH clause combined with UNNEST clause to create temporary datasets. Here is how to create and use the datasets above: .. code-block:: sql WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 2, 3]) as _t(id, value)) > SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value); id | value ------+------- NULL | 0 1 | 1 2 | 2 (3 rows) > SELECT * FROM unnest(array[null, 2, 3], array[0, 2, 3]) as _t(id, value); id | value ------+------- NULL | 0 2 | 2 3 | 3 (3 rows) > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 1, 2]) as _t(id, value)) SELECT * FROM t FULL JOIN u ON t.id = u.id; id | value | id | value ------+-------+------+------- 1 | 1 | NULL | NULL 2 | 2 | 2 | 1 NULL | NULL | 3 | 2 NULL | 0 | NULL | NULL NULL | NULL | NULL | 0 (5 rows) NOT IN Semantics --------------------------- We need to consider 3 cases of NOT IN queries: * Subquery returns a row with a NULL. * Subquery returns no rows. * Subquery returns one or more rows and no row has a NULL. NOT IN with NULLs in the subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider NOT IN query: .. code-block:: sql SELECT * FROM t WHERE t.id NOT IN (SELECT * FROM u) This query returns an empty result. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 2, 3]) as _t(id, value)) SELECT * from t WHERE t.id NOT IN (SELECT id FROM u); id | value ----+------- (0 rows) This is because the IN LIST contains 3 values: NULL, 2, 3. In SQL, NULL is considered an unknown value. In this case the IN LIST contains unknown value and we cannot definitively say whether any given value is in the list or not. Therefore NOT IN predicate returns a NULL (unknown), hence, the query returns no result. You can use the following queries to confirm the semantics of the NOT IN predicate. .. code-block:: sql > SELECT 1 not in (null, 2, 3); _col0 ------- NULL (1 row) > SELECT null not in (null, 2, 3); _col0 ------- NULL (1 row) NOT IN without NULLs in the subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, consider NOT IN query where subquery doesn’t return NULLs (by removing NULLs from the u table or adding u.id IS NOT NULL predicate to the subquery). .. code-block:: sql SELECT * FROM t WHERE t.id NOT IN ( SELECT * FROM u WHERE u.id IS NOT NULL ) This query returns a single row with id 1. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[2, 3], array[1, 2]) as _t(id, value)) SELECT * from t WHERE t.id NOT IN (SELECT id FROM u); id | value ----+------- 1 | 1 (1 row) In this case the IN LIST contains 2 values: 2 and 3. NULL NOT IN (2, 3) returns NULL because we cannot tell definitively whether an unknown value is part of a set or not, hence, not included in the results. 1 NOT IN (2, 3) returns true, hence, included in the results. 2 NOT IN (2, 3) returns false, hence, not included in the results. NOT IN with empty subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, consider a NOT IN query with a subquery that returns empty results (by removing all rows from the u table or adding an always false predicate to the subquery). .. code-block:: sql SELECT * FROM t WHERE t.id NOT IN ( SELECT * FROM u WHERE u.id < 0 ) This query returns all rows from t, including the row with NULL id. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[], array[]) as _t(id, value)) SELECT * from t WHERE t.id NOT IN (SELECT id FROM u); id | value ------+------- 1 | 1 2 | 2 NULL | 0 (3 rows) Here, the IN LIST is empty. Hence, all values, including unknown value (NULL), can be determined to be not part of that set. NOT EXISTS Semantics ------------------------------- Similar to NOT IN queries, we consider 3 cases: * Subquery returns a row with a NULL. * Subquery returns no rows. * Subquery returns one or more rows and no row has a NULL. NOT EXISTS with NULLs in the subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider NOT EXISTS query: .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS (SELECT id FROM u WHERE u.id = t.id) This query returns 2 rows with IDs NULL and 1. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 1, 2]) as _t(id, value)) SELECT * from t WHERE NOT EXISTS (SELECT * FROM u WHERE u.id = t.id); id | value ------+------- NULL | 0 1 | 1 (2 rows) Here, we have a correlated subquery, e.g. subquery that includes columns from the outer query. This subquery returns different results for different outer query rows. For the row with id NULL, the sub-query is .. code-block:: sql SELECT * FROM u WHERE u.id = NULL u.id = NULL predicate always returns NULL, hence, the subquery returns an empty result, hence, NOT EXISTS clause evaluates to true. For the row with id 1, the sub-query is .. code-block:: sql SELECT * FROM u WHERE u.id = 1 u.id = 1 evaluates to NULL when u.id is null and false when u.id is 2 or 3. Hence, the subquery results are empty, hence, NOT EXISTS clause evaluate to true. For the row with id 2, the sub-query is .. code-block:: sql SELECT * FROM u WHERE u.id = 2 u.id = 2 predicate evaluates to true for the row where u.id is 2, hence, the subquery results are not empty, hence, the NOT EXISTS clause evaluates to false. NOT EXISTS without NULLs in the subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, consider NOT EXISTS query with no nulls in the subquery: .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS ( SELECT id FROM u WHERE u.id = t.id AND u.id IS NOT NULL ) This query returns 2 rows with IDs NULL and 1. In fact, the presence of NULLs in the subquery doesn’t affect the results of the NOT EXISTS clause. This is because u.id = t.id predicate evaluates to NULL when u.id is NULL, hence, rows with NULLs are excluded from the subquery. Unlike the NOT IN query, NOT EXISTS query is not sensitive for NULLs in the subquery. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[2, 3], array[1, 2]) as _t(id, value)) SELECT * from t WHERE NOT EXISTS (SELECT * FROM u WHERE u.id = t.id); id | value ------+------- 1 | 1 NULL | 0 (2 rows) NOT EXISTS with empty subquery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, consider a NOT EXISTS query with a subquery that returns empty results. .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS ( SELECT id FROM u WHERE u.id = t.id AND u.id < 0 ) This query returns all rows from t because subquery always returns an empty result set. When subquery is empty, the results of NOT IN and NOT EXISTS queries are the same. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[], array[]) as _t(id, value)) SELECT * from t WHERE NOT EXISTS (SELECT * FROM u WHERE u.id = t.id); id | value ------+------- 2 | 2 1 | 1 NULL | 0 (3 rows) Implementation -------------- NOT IN and NOT EXISTS queries can be implemented efficiently using anti joins. NOT IN queries are implemented using NULL AWARE ANTI JOIN. NOT EXISTS queries are implemented using regular ANTI JOIN. NULL AWARE ANTI JOIN ~~~~~~~~~~~~~~~~~~~~ NULL AWARE ANTI JOIN is used to implement NOT IN queries. .. code-block:: sql SELECT * FROM t WHERE t.id NOT IN (SELECT id FROM u) The rows from table t are placed on the left side of the join. The rows from the subquery are placed on the right side of the join. The subquery rows are loaded into a hash table keyed on “id”. If a NULL is encountered when building the hash table, the join finishes early with no results. If the hash table is empty (i.e. subquery returns no results), the join returns all the rows from the left side including rows with NULL join key. If the hash table is not empty and has no NULLs, the rows from the left side with no NULLs in the join key are processed in streaming fashion. For each row, the join looks up a match in the hash table and returns the row only if there is no match. Rows from the left side with NULL in the join key are not returned. This algorithm extends trivially to multiple join keys and NOT IN queries that look like this: .. code-block:: sql SELECT * FROM t WHERE (t.id1, t.id2) NOT IN (SELECT id1, id2 FROM u) To summarize, NULL AWARE ANTI JOIN semantics include * Return empty results when the right side contains nulls in the join keys. * Return left-side rows with NULLs in the join key only when the right side is empty. In a distributed setup, evaluating the above conditions requires that every node knows whether the combined right side is empty or not and whether it contains a row with a null in the join key. This information is available if the query broadcasts the right side or uses replicate-nulls-and-any partitioning strategy. NOTE: Replicate-null-and-any partitioning strategy replicates all rows with nulls in the partition-by keys to all destinations and also replicates one arbitrary chosen row with no nulls in the partition-by keys. ANTI JOIN ~~~~~~~~~ Regular ANTI JOIN is used to implement NOT EXISTS queries. .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS (SELECT * FROM u WHERE u.id = t.id) First, we rewrite the subquery to return the equi-join clause u.id = t.id. The rows from table t are placed on the left side of the join. The rows from the modified subquery are placed on the right side of the join. The subquery rows are loaded into a hash table keyed on “id”. Subquery rows with NULL join keys are skipped. If the hash table is empty (i.e. subquery returns no results or all results have NULLs in join keys), the join returns all the rows from the left side including rows with NULL join key. This logic is the same between regular ANTI JOIN and NULL AWARE ANTI JOIN. If the hash table is not empty, the rows from the left side are processed in streaming fashion. All rows with NULL in the join key are included in the results unconditionally. For each row with non-NULL join key, the join looks up a match in the hash table and returns the row only if there is no match. This algorithm extends trivially to multiple join keys and NOT EXISTS queries that look like this: .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS ( SELECT * FROM u WHERE u.id1 = t.id1 AND u.id2 = t.id2 ) The differences between regular and null aware anti join can be summarized as * Regular join doesn’t automatically return empty results when the right side has NULLs in the join keys. * Regular join unconditionally returns left side rows with NULLs in the join keys. ANTI JOINs with Extra Filter ---------------------------- NOT IN and NOT EXISTS queries may contain non-equality conditions that use columns from the outer query in the subqueries. For example, .. code-block:: sql SELECT * FROM t WHERE t.id NOT IN (SELECT id FROM u WHERE u.value > t.value) or .. code-block:: sql SELECT * FROM t WHERE NOT EXISTS ( SELECT * FROM u WHERE u.id = t.id AND u.value > t.value ) In this case, whether the subquery contains NULL in the join key or not depends on values of the outer row and therefore can be different for different outer rows. Hence, a row with a null in the join key on the right side, doesn’t automatically make the null aware anti join return empty results. This can be seen in an example. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 1, 2]) as _t(id, value)) SELECT * from t WHERE t.id NOT IN (SELECT id FROM u WHERE u.value > t.value); id | value ----+------- 1 | 1 2 | 2 (2 rows) In this query, subquery for row with NULL id is .. code-block:: sql SELECT id FROM u WHERE u.value > 0 This subquery returns rows with ids 2 and 3. Row with NULL id hash value equal to 0 and doesn’t pass u.value > 0 predicate. NULL NOT IN (2, 3) returns NULL, hence, NULL row from the left side is not included in the query result. Subquery for row with id = 1 is .. code-block:: sql SELECT id FROM u WHERE u.value > 1 This subquery returns two rows with ids 2 and 3. 1 NOT IN (2, 3) returns true, hence, row with id 1 is included in the query results. Subquery for row with id = 2 is .. code-block:: sql SELECT id FROM u WHERE u.value > 2 This subquery returns a single row with id 3. 2 NOT IN (3) returns true, hence, row with id 2 is included in the query results. Let’s consider a different example where results include the NULL row. .. code-block:: sql > WITH t as (SELECT * FROM unnest(array[null, 1, 2], array[0, 1, 2]) as _t(id, value)), u as (SELECT * FROM unnest(array[null, 2, 3], array[0, 1, 2]) as _t(id, value)) SELECT * from t WHERE t.id NOT IN (SELECT id FROM u WHERE u.value * t.value > 0); id | value ------+------- NULL | 0 1 | 1 (2 rows) This query returns the row with NULL. The subquery for that row is: .. code-block:: sql SELECT id FROM u WHERE u.value * 0 > 0 The predicate evaluates to false for all rows in u, hence, the IN LIST is empty, hence NULL NOT IN evaluates to true. These queries are implemented using anti joins with extra filters. In the examples above, the implementations use null aware anti joins with extra filters u.value > t.value and u.value * t.value > 0. The presence of extra filters changes the implementation of anti join. NULL AWARE ANTI JOIN with Filter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the presence of an extra filter, null aware anti join cannot finish early upon encountering a null in the join key on the right side. The join must finish building the hash table and include all rows, even the ones with nulls in the join key. When evaluating left-side rows, the join needs to first collect all matches from the build side and combine these with all the right-side rows with nulls in the join key, then evaluate the filter on the matches. If the filter comes out empty, the row is included in the results. Otherwise, the row is not included. A more detailed description of this logic goes like this. #. Collect the matches. #. If the left side row doesn’t have a null in the join key, include all matches from the right side. #. If the left side row has a null in the join key, include all rows from the right side. #. For all left side rows, include all rows from the right side with nulls in the join keys. #. Evaluate the filter on the matches collected in the previous step. #. Include the left-side row in the results only If the filter comes out empty. Step 1.2 requires evaluating the filter on the cross join of left-side rows with nulls in the join key with all the right-side rows. This implies that in a distributed setup the right side must be replicated (broadcasted) to all the nodes evaluating the join, while the left side can be distributed among nodes using any convenient strategy. Step 1.3 requires that right-side rows with nulls in the join keys are replicated (broadcasted) to all the nodes evaluating the join. This is achieved using replicate-nulls-and-any partitioning strategy. ANTI JOIN with Filter ~~~~~~~~~~~~~~~~~~~~~ In the presence of an extra filter, regular anti join can still unconditionally return left side rows with nulls in the join key. The subquery with an extra filter still returns an empty result for these rows. For the left-side row with no nulls in the join key, the join needs to collect the matches from the right side. If there are no matches, the row is included in the results. If there are matches, the extra filter needs to be evaluated. If the filter comes out empty, the row is included in the results.