Pyvelox Plan Builder Api¶
- class pyvelox.plan_builder.PlanNode¶
- __init__(*args, **kwargs)¶
- __str__(self: pyvelox.plan_builder.PlanNode) str ¶
Returns a short and recursive description of the plan.
- id(self: pyvelox.plan_builder.PlanNode) str ¶
Returns the id of the current plan node.
- name(self: pyvelox.plan_builder.PlanNode) str ¶
Returns the name of the current plan node.
- serialize(self: pyvelox.plan_builder.PlanNode) str ¶
Returns a serialized string containing the plan specification.
- to_string(self: pyvelox.plan_builder.PlanNode) str ¶
Returns a detailed and recursive description of the plan.
- class pyvelox.plan_builder.PlanBuilder¶
- __init__(self: pyvelox.plan_builder.PlanBuilder) None ¶
- aggregate(self: pyvelox.plan_builder.PlanBuilder, grouping_keys: List[str] = [], aggregations: List[str] = []) pyvelox.plan_builder.PlanBuilder ¶
Adds a single stage aggregation.
- Args:
grouping_keys: List of columns to group by. aggregations: List of aggregate expressions.
- filter(self: pyvelox.plan_builder.PlanBuilder, filter: str = '') pyvelox.plan_builder.PlanBuilder ¶
Adds a filter node. The filter expression is specified as a SQL expression.
- get_plan_node(self: pyvelox.plan_builder.PlanBuilder) pyvelox.plan_builder.PlanNode | None ¶
Returns the current plan node.
- hash_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], build_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], filter: str = '', join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder ¶
Adds a hash join node. Uses the build_plan_node subtree to build the hash table, and the current subtree as the probe side.
- Args:
left_keys: List of keys from the left table (probe). right_keys: List of keys from the right table (build). build_plan_node: The plan node defined the subplan to join with. output: List of columns to be projected out of the join. filter: Optional join filter expression. join_type: Join type (inner, left, right, full, etc).
- index_lookup_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], index_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder ¶
Adds an index lookup join node. It requires the index_plan_node subtree to be composed of a single table scan on a connector with indexed access support.
- Args:
left_keys: List of keys from the left table. right_keys: List of keys from the right table. index_plan_node: The subtree containing the lookup table scan. output: List of columns to be projected out of the join. join_type: Join type (inner, left, right, full, etc).
- limit(self: pyvelox.plan_builder.PlanBuilder, count: int, offset: int = 0, is_partial: bool = False) pyvelox.plan_builder.PlanBuilder ¶
Limit how many rows from the input to produce as output.
- Args:
count: How many rows to produce, at most. offset: Hoy many rows from the beggining of the input to skip. is_partial: If this is restricting partial results and hence
can be applied once per driver, or if it’s applied to the query output.
- merge_join(self: pyvelox.plan_builder.PlanBuilder, left_keys: List[str], right_keys: List[str], right_plan_node: pyvelox.plan_builder.PlanNode, output: List[str] = [], filter: str = '', join_type: pyvelox.plan_builder.JoinType = <JoinType.INNER: 0>) pyvelox.plan_builder.PlanBuilder ¶
Adds a merge join node. Merge join requires that left and right sides and sorted based on the join keys.
- Args:
left_keys: List of keys from the left table. right_keys: List of keys from the right table. right_plan_node: The plan node defined the subplan to join with. output: List of columns to be projected out of the join. filter: Optional join filter expression. join_type: Join type (inner, left, right, full, etc).
- new_builder(self: pyvelox.plan_builder.PlanBuilder) pyvelox.plan_builder.PlanBuilder ¶
Returns a new builder sharing the same plan node id generator, so that they can be safely reused to build different parts of the same plan.
- order_by(self: pyvelox.plan_builder.PlanBuilder, keys: List[str], is_partial: bool = False) pyvelox.plan_builder.PlanBuilder ¶
Sorts the input based on the values of sorting keys.
- Args:
- keys: List of columns to order by. The strings can be column names
and optionally contain the sort orientation (“col” or “col DESC”).
- is_partial: If this node is sorting partial query results (and hence
can run in parallel in multiple drivers), or final.
- project(self: pyvelox.plan_builder.PlanBuilder, projections: List[str] = []) pyvelox.plan_builder.PlanBuilder ¶
Adds a projection node, calculating expression specified in projections. Expressions are specified as SQL expressions.
- sorted_merge(self: pyvelox.plan_builder.PlanBuilder, keys: List[str], sources: List[pyvelox.plan_builder.PlanNode | None]) pyvelox.plan_builder.PlanBuilder ¶
Takes N sorted source subtrees and merges them into a sorted output. Assumes that all sources are sorted on keys.
- Args:
keys: The sorting keys. sources: The list of sources to merge.
- table_scan(self: pyvelox.plan_builder.PlanBuilder, output_schema: pyvelox.type.Type = <pyvelox.type.Type object at 0x7f5674e8de70>, aliases: dict = {}, subfields: dict = {}, filters: List[str] = [], remaining_filter: str = '', row_index: str = '', connector_id: str = 'hive', input_files: Optional[List[facebook::velox::py::PyFile]] = None) pyvelox.plan_builder.PlanBuilder ¶
Adds a table scan node to the plan.
- Args:
- output_schema: A RowType containing the schema to be projected out
of the scan.
- aliases: An optional map of aliases to apply, from the desired
output name to the name as defined in the file. If there are aliases, output should be specified based on the aliased name.
- subfields: Used to project individual items from columns instead
of reading entire containers. It maps from the column name to a list of items to be projected out.
- filters: A list of SQL filters to be applied to the data as it is
decoded/read.
- remainingFilter: SQL expression for the additional conjunct. May
include multiple columns and SQL functions. The remainingFilter is AND’ed with the other filters.
- row_index: If defined, creates an output column with this name
producing $row_ids. This name needs to be part of the output as BIGINT.
connector_id: ID of the connector to use for this scan. input_files: If defined, uses as the input files so that no splits
will need to be added later.
- table_write(self: pyvelox.plan_builder.PlanBuilder, output_file: Optional[facebook::velox::py::PyFile] = None, output_path: Optional[facebook::velox::py::PyFile] = None, connector_id: str = 'hive', output_schema: Optional[pyvelox.type.Type] = None) pyvelox.plan_builder.PlanBuilder ¶
Adds a table write node to the plan.
- Args:
output_file: Name of the file to be written. output_path: The output path where output files will be written.
Specify this parameter instead of outputFile if the task is supposed to write files in parallel using multiple drivers. The actual file names in this path will be automatically generated and returned as the TableWriter output. Takes precedence over output_file.
connector_id: ID of the connector to use for this scan. output_schema: An optional RowType containing the schema to be
written to the file. By default write the schema produced by the operator upstream.
- tpch_gen(self: pyvelox.plan_builder.PlanBuilder, table_name: str, columns: List[str] = [], scale_factor: float = 1, num_parts: int = 1, connector_id: str = 'tpch') pyvelox.plan_builder.PlanBuilder ¶
Generates TPC-H data on the fly using dbgen. Note that generating data on the fly is not terribly efficient, so for performance evaluation one should generate data using this node, write it to output storage files, (Parquet, ORC, or similar), then benchmark a query plan that reads those files.
- Args:
table_name: The TPC-H table name to generate data for. columns: The columns from table_name to generate data for. If
empty (the default), generate data for all columns.
- scale_factor: TPC-H scale factor to use - controls the amount of
data generated.
- num_parts: How many splits to generate. This controls the parallelism
and the number of output files to be generated.
connector_id: ID of the connector to use for this scan.
- values(self: pyvelox.plan_builder.PlanBuilder, values: List[pyvelox.vector.Vector] = []) pyvelox.plan_builder.PlanBuilder ¶
Adds the specified vectors to the operator tree as input. All input vectors need to be RowVectors.