Pyvelox Plan Builder Api

class pyvelox.plan_builder.PlanNode
__init__(*args, **kwargs)
__str__(self: pyvelox.plan_builder.PlanNode) str

Returns a short and recursive description of the plan.

id(self: pyvelox.plan_builder.PlanNode) str

Returns the id of the current plan node.

name(self: pyvelox.plan_builder.PlanNode) str

Returns the name of the current plan node.

serialize(self: pyvelox.plan_builder.PlanNode) str

Returns a serialized string containing the plan specification.

to_string(self: pyvelox.plan_builder.PlanNode) str

Returns a detailed and recursive description of the plan.

class pyvelox.plan_builder.PlanBuilder
__init__(self: pyvelox.plan_builder.PlanBuilder) None
aggregate(self: pyvelox.plan_builder.PlanBuilder, grouping_keys: List[str] = [], aggregations: List[str] = []) pyvelox.plan_builder.PlanBuilder

Adds a single stage aggregation.

Args:

grouping_keys: List of columns to group by. aggregations: List of aggregate expressions.

filter(self: pyvelox.plan_builder.PlanBuilder, filter: str = '') pyvelox.plan_builder.PlanBuilder

Adds a filter node. The filter expression is specified as a SQL expression.

get_plan_node(self: pyvelox.plan_builder.PlanBuilder) pyvelox.plan_builder.PlanNode | None

Returns the current plan node.

hash_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], build_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], filter: str = '', join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder

Adds a hash join node. Uses the build_plan_node subtree to build the hash table, and the current subtree as the probe side.

Args:

left_keys: List of keys from the left table (probe). right_keys: List of keys from the right table (build). build_plan_node: The plan node defined the subplan to join with. output: List of columns to be projected out of the join. filter: Optional join filter expression. join_type: Join type (inner, left, right, full, etc).

index_lookup_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], index_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder

Adds an index lookup join node. It requires the index_plan_node subtree to be composed of a single table scan on a connector with indexed access support.

Args:

left_keys: List of keys from the left table. right_keys: List of keys from the right table. index_plan_node: The subtree containing the lookup table scan. output: List of columns to be projected out of the join. join_type: Join type (inner, left, right, full, etc).

limit(self: pyvelox.plan_builder.PlanBuilder, count: int, offset: int = 0, is_partial: bool = False) pyvelox.plan_builder.PlanBuilder

Limit how many rows from the input to produce as output.

Args:

count: How many rows to produce, at most. offset: Hoy many rows from the beggining of the input to skip. is_partial: If this is restricting partial results and hence

can be applied once per driver, or if it’s applied to the query output.

merge_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], right_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], filter: str = '', join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder

Adds a merge join node. Merge join requires that left and right sides and sorted based on the join keys.

Args:

left_keys: List of keys from the left table. right_keys: List of keys from the right table. right_plan_node: The plan node defined the subplan to join with. output: List of columns to be projected out of the join. filter: Optional join filter expression. join_type: Join type (inner, left, right, full, etc).

new_builder(self: pyvelox.plan_builder.PlanBuilder) pyvelox.plan_builder.PlanBuilder

Returns a new builder sharing the same plan node id generator, so that they can be safely reused to build different parts of the same plan.

order_by(self: pyvelox.plan_builder.PlanBuilder, keys: List[str], is_partial: bool = False) pyvelox.plan_builder.PlanBuilder

Sorts the input based on the values of sorting keys.

Args:
keys: List of columns to order by. The strings can be column names

and optionally contain the sort orientation (“col” or “col DESC”).

is_partial: If this node is sorting partial query results (and hence

can run in parallel in multiple drivers), or final.

project(self: pyvelox.plan_builder.PlanBuilder, projections: List[str] = []) pyvelox.plan_builder.PlanBuilder

Adds a projection node, calculating expression specified in projections. Expressions are specified as SQL expressions.

sorted_merge(self: pyvelox.plan_builder.PlanBuilder, keys: List[str], sources: List[pyvelox.plan_builder.PlanNode | None]) pyvelox.plan_builder.PlanBuilder

Takes N sorted source subtrees and merges them into a sorted output. Assumes that all sources are sorted on keys.

Args:

keys: The sorting keys. sources: The list of sources to merge.

table_scan(self: pyvelox.plan_builder.PlanBuilder, output_schema: pyvelox.type.Type = <pyvelox.type.Type object at 0x7f5674e8de70>, aliases: dict = {}, subfields: dict = {}, filters: List[str] = [], remaining_filter: str = '', row_index: str = '', connector_id: str = 'hive', input_files: Optional[List[facebook::velox::py::PyFile]] = None) pyvelox.plan_builder.PlanBuilder

Adds a table scan node to the plan.

Args:
output_schema: A RowType containing the schema to be projected out

of the scan.

aliases: An optional map of aliases to apply, from the desired

output name to the name as defined in the file. If there are aliases, output should be specified based on the aliased name.

subfields: Used to project individual items from columns instead

of reading entire containers. It maps from the column name to a list of items to be projected out.

filters: A list of SQL filters to be applied to the data as it is

decoded/read.

remainingFilter: SQL expression for the additional conjunct. May

include multiple columns and SQL functions. The remainingFilter is AND’ed with the other filters.

row_index: If defined, creates an output column with this name

producing $row_ids. This name needs to be part of the output as BIGINT.

connector_id: ID of the connector to use for this scan. input_files: If defined, uses as the input files so that no splits

will need to be added later.

table_write(self: pyvelox.plan_builder.PlanBuilder, output_file: Optional[facebook::velox::py::PyFile] = None, output_path: Optional[facebook::velox::py::PyFile] = None, connector_id: str = 'hive', output_schema: Optional[pyvelox.type.Type] = None) pyvelox.plan_builder.PlanBuilder

Adds a table write node to the plan.

Args:

output_file: Name of the file to be written. output_path: The output path where output files will be written.

Specify this parameter instead of outputFile if the task is supposed to write files in parallel using multiple drivers. The actual file names in this path will be automatically generated and returned as the TableWriter output. Takes precedence over output_file.

connector_id: ID of the connector to use for this scan. output_schema: An optional RowType containing the schema to be

written to the file. By default write the schema produced by the operator upstream.

tpch_gen(self: pyvelox.plan_builder.PlanBuilder, table_name: str, columns: List[str] = [], scale_factor: float = 1, num_parts: int = 1, connector_id: str = 'tpch') pyvelox.plan_builder.PlanBuilder

Generates TPC-H data on the fly using dbgen. Note that generating data on the fly is not terribly efficient, so for performance evaluation one should generate data using this node, write it to output storage files, (Parquet, ORC, or similar), then benchmark a query plan that reads those files.

Args:

table_name: The TPC-H table name to generate data for. columns: The columns from table_name to generate data for. If

empty (the default), generate data for all columns.

scale_factor: TPC-H scale factor to use - controls the amount of

data generated.

num_parts: How many splits to generate. This controls the parallelism

and the number of output files to be generated.

connector_id: ID of the connector to use for this scan.

values(self: pyvelox.plan_builder.PlanBuilder, values: List[pyvelox.vector.Vector] = []) pyvelox.plan_builder.PlanBuilder

Adds the specified vectors to the operator tree as input. All input vectors need to be RowVectors.