aitemplate.compiler.transform

apply_padding

Applies paddings to gemms based on alignment requirements.

Functions:

apply_padding(sorted_graph[, workdir])

Applies padding to gemms to use SM80 kernels.

aitemplate.compiler.transform.apply_padding.apply_padding(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Applies padding to gemms to use SM80 kernels. SM80 kernels require min_alignment == 2.

bind_constants

Bind all user-provided constants to the graph.

Functions:

bind_constants(graph, constants)

Bind all user-provided constants to the graph. Internally, the constants are represented as ConstantTensors. These can be folded, and are packaged into the final *.so.

aitemplate.compiler.transform.bind_constants.bind_constants(graph: List[Tensor], constants: Dict[str, TorchTensor]) → None[source]

Bind all user-provided constants to the graph. Internally, the constants are represented as ConstantTensors. These can be folded, and are packaged into the final *.so.

Parameters:

graph (List[Tensor]) – Input graph
constants (Dict[str, TorchTensor]) – Constants to bind

constant_folding

Classes:

`IntVarTensor`(int_var[, name, src_ops, ...])	A special tensor which represents an IntImm / IntVar.
`Workspace`(shared_size, unique_size)

Functions:

constant_folding(sorted_graph, workdir, ...)

Fold and propagate constants.

class aitemplate.compiler.transform.constant_folding.IntVarTensor(int_var: IntVar, name: Optional[str] = None, src_ops: Optional[Set[Node]] = None, dst_ops: Optional[Set[Node]] = None, dtype: str = 'float16', is_input: bool = False, is_output: bool = False, value: Optional[Any] = None, is_view_of: Optional[Any] = None)[source]

A special tensor which represents an IntImm / IntVar. This Tensor can be used as inputs of some Operators (e.g. reshape, layernorm). An IntVarTensor instead of IntVar is used here to keep reference to src_ops and dst_ops.

Methods:

pseudo_code([with_shape])

Returns a string containing pseudo code of this object.

pseudo_code(with_shape=True) → str[source]

Returns a string containing pseudo code of this object.

Parameters:: with_shape (bool) – Marks whether to include shape info in the returned pseudo code.
Returns:: Pseudo code.
Return type:: str

class aitemplate.compiler.transform.constant_folding.Workspace(shared_size: int, unique_size: int)[source]

aitemplate.compiler.transform.constant_folding.constant_folding(sorted_graph: List[Tensor], workdir: str, model_name: str) → Tuple[List[Tensor], List[Tuple[str, str]], List[Tensor]][source]

Fold and propagate constants.

This pass looks for ops that have inputs which can be determined at compile time. It evaluates them, then puts the new constants back into the graph with bound data. The old ops are eliminated.

This pass actually compiles and runs an AIT runtime. If there are any problems (e.g. due to buggy ops), the constant folding is aborted and the graph is returned unchanged. All generated code is stored in workdir/constant_folding.

fuse_conv_elementwise

Fuse conv + elementwise ops.

Functions:

`fuse_conv_elementwise`(sorted_graph, _)	Fuse conv + elementwise ops.
`get_conv2d_bias_elementwise_patterns`()	We create the pattern of fusion here.

aitemplate.compiler.transform.fuse_conv_elementwise.fuse_conv_elementwise(sorted_graph: List[Tensor], _: str) → List[Tensor][source]: Fuse conv + elementwise ops. The second argument is unused, it’s only here to make the type of this function the same as the others called in optimize_graph.

aitemplate.compiler.transform.fuse_conv_elementwise.get_conv2d_bias_elementwise_patterns()[source]

We create the pattern of fusion here. The format should be in the form of (pattern, replacement)

pattern: This would be a list of operator which are chained which we: want to match

replacement: The op to replace pattern.

fuse_group_ops

Horizontal fusion pass to group ops together.

Functions:

`fuse_group_ops`(sorted_graph[, workdir])	Horizontal fusion of grouped gemm and layernorm ops
`toposort`(nodes)	Generate sorted nodes by topological order.

aitemplate.compiler.transform.fuse_group_ops.fuse_group_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Horizontal fusion of grouped gemm and layernorm ops

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – working dir, by default None

Returns:

New graph after fusion

Return type:

List[Tensor]

aitemplate.compiler.transform.fuse_group_ops.toposort(nodes: Union[Tensor, List[Tensor]]) → List[Tensor][source]

Generate sorted nodes by topological order. This is the foundation of all graph passes.

Parameters:: nodes (Union[Tensor, List[Tensor]]) – The output of the model
Returns:: Sorted graph
Return type:: List[Tensor]

fuse_mm_elementwise

Fuse GEMM with elementwise operations

Functions:

fuse_mm_elementwise(sorted_graph[, workdir])

Fuse GEMMs with elementwise operations.

aitemplate.compiler.transform.fuse_mm_elementwise.fuse_mm_elementwise(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Fuse GEMMs with elementwise operations.

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – working dir, by default None

Returns:

Fused graph

Return type:

List[Tensor]

fuse_ops

Perform operator fusions.

Classes:

`FusedElementwiseInfo`(partitioned_ops, ...)
`fused_elementwise`(elementwise_ops, inputs, ...)	fused_elementwise operator is used internally.
`group_norm`(num_groups, num_channels)	Standalone group norm op.
`group_norm_swish`(num_groups, num_channels)	Standalone group norm op.

Functions:

`dataclass`([cls, init, repr, eq, order, ...])	Returns the same class as was passed in, with dunder methods added based on the fields defined in the class.
`fuse_elementwise`(sorted_graph[, workdir])	Given a sorted graph, returns a sorted graph with fused_elementwise ops on fusable elementwise ops.
`process_singleton_elementwise`(sorted_graph)	A dummy pass which enables codegen for any elementwise op without fusing it with neighbors
`toposort`(nodes)	Generate sorted nodes by topological order.

class aitemplate.compiler.transform.fuse_ops.FusedElementwiseInfo(partitioned_ops: List[aitemplate.compiler.base.Operator], inputs: List[aitemplate.compiler.base.Tensor], outputs: List[aitemplate.compiler.base.Tensor], external_inputs: List[aitemplate.compiler.base.Tensor], external_outputs: List[aitemplate.compiler.base.Tensor])[source]

aitemplate.compiler.transform.fuse_ops.dataclass(cls=None, /, *, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)[source]

Returns the same class as was passed in, with dunder methods added based on the fields defined in the class.

Examines PEP 526 __annotations__ to determine fields.

If init is true, an __init__() method is added to the class. If repr is true, a __repr__() method is added. If order is true, rich comparison dunder methods are added. If unsafe_hash is true, a __hash__() method function is added. If frozen is true, fields may not be assigned to after instance creation.

aitemplate.compiler.transform.fuse_ops.fuse_elementwise(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Given a sorted graph, returns a sorted graph with fused_elementwise ops on fusable elementwise ops.

class aitemplate.compiler.transform.fuse_ops.fused_elementwise(elementwise_ops: List[elementwise], inputs: Iterable[Operator], outputs: Iterable[Operator])[source]

fused_elementwise operator is used internally. It’s the actual operator which does ++ codegen.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.transform.fuse_ops.group_norm(num_groups: int, num_channels: int)[source]

Standalone group norm op. The grouped dim must be the last dim of the input tensor.

Methods:

`gen_function`()	Generates function source code string.
`gen_profiler`([workdir, ...])	Generator profiler.
`get_input_shapes`(x, gamma, beta)	Return a list of shapes for x, gamma and beta, where gamma_shape and beta_shape may be None if gamma and beta are None, respectively.
`profile`([workdir, devices, ...])	Selects the fastest kernel configurations.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Generator profiler. The profiler files are standalone executable for profiling.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

static get_input_shapes(x, gamma, beta) → List[List[Union[IntVar, IntImm]]][source]: Return a list of shapes for x, gamma and beta, where gamma_shape and beta_shape may be None if gamma and beta are None, respectively.

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.MAX)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
devices (list, optional) – Devices used for profiling, by default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.transform.fuse_ops.group_norm_swish(num_groups: int, num_channels: int)[source]: Standalone group norm op. The grouped dim must be the last dim of the input tensor.

aitemplate.compiler.transform.fuse_ops.process_singleton_elementwise(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: A dummy pass which enables codegen for any elementwise op without fusing it with neighbors

aitemplate.compiler.transform.fuse_ops.toposort(nodes: Union[Tensor, List[Tensor]]) → List[Tensor][source]

Generate sorted nodes by topological order. This is the foundation of all graph passes.

Parameters:: nodes (Union[Tensor, List[Tensor]]) – The output of the model
Returns:: Sorted graph
Return type:: List[Tensor]

fuse_parallel_gemms

Fuse parallel gemms into bmm op.

Functions:

`fuse_parallel_gemms`(sorted_graph[, workdir])	Fuse parallel gemms into a single gemm op.
`fuse_single_source_parallel_gemms`(sorted_graph)	This pass fuses patterns like # x: [m, k], w_i: [n_i, k], b_i: [n_i] y1 = gemm_rcr_bias()(x, w1, b1) y2 = gemm_rcr_bias()(x, w2, b2) ...
`toposort`(nodes)	Generate sorted nodes by topological order.

aitemplate.compiler.transform.fuse_parallel_gemms.fuse_parallel_gemms(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Fuse parallel gemms into a single gemm op. Currently, we only support the following patterns:

parallel gemm + concat
split->parallel gemm->concat

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – working dir, by default None

Returns:

Fused graph

Return type:

List[Tensor]

aitemplate.compiler.transform.fuse_parallel_gemms.fuse_single_source_parallel_gemms(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

This pass fuses patterns like # x: [m, k], w_i: [n_i, k], b_i: [n_i] y1 = gemm_rcr_bias()(x, w1, b1) y2 = gemm_rcr_bias()(x, w2, b2) …

into: # x: [m, k], w: [sum(n_i), k], b: [sum(n_i)] w = concatenate()([w1, w2], dim=0) b = concatenate()([b1, b2], dim=0) y = gemm_rcr_bias()(x, w, b) y1, y2 = split()(y, n)

For w and b, we rely on constant folding to preprocess them. y1 and y2 would be written directly from y’s op. It is required that all the gemm ops have the same layouts.

On graph pass ordering, we need to make sure this pass runs before any other pass that modifies gemm and concat input/output TensorAccessors.

Parameters:: sorted_graph (List[Tensor]) – a sorted list of tensors
Returns:: the transformed graph with all ops sorted
Return type:: List[Tensor]

aitemplate.compiler.transform.fuse_parallel_gemms.toposort(nodes: Union[Tensor, List[Tensor]]) → List[Tensor][source]

Generate sorted nodes by topological order. This is the foundation of all graph passes.

Parameters:: nodes (Union[Tensor, List[Tensor]]) – The output of the model
Returns:: Sorted graph
Return type:: List[Tensor]

fuse_permute_bmm

fuse_split

Perform transformations on ops which support strided inputs / outputs.

Classes:

StableSet([s])

class aitemplate.compiler.transform.fuse_split.StableSet(s: Optional[Iterable[Any]] = None)[source]

Methods:

`add`(value)	Add an element.
`clear`()	This is slow (creates N new iterators!) but effective.
`discard`(value)	Remove an element.
`remove`(value)	Remove an element.

add(value) → None[source]: Add an element.

clear()[source]: This is slow (creates N new iterators!) but effective.

discard(value) → None[source]: Remove an element. Do not raise an exception if absent.

remove(value) → None[source]: Remove an element. If not a member, raise a KeyError.

mark_param_tensor

mark tensors which are parameters

Functions:

`mark_param_tensor`(sorted_graph)	Mark constant tensors: those that have no ops and are not explicitly marked as inputs.
`mark_special_views`(sorted_graph)	Associate each tensor with an external tensor if any of the conditions are true: 1.

aitemplate.compiler.transform.mark_param_tensor.mark_param_tensor(sorted_graph: List[Tensor])[source]

Mark constant tensors: those that have no ops and are not explicitly marked as inputs.

Parameters:: sorted_graph (List[Tensor]) – The graph to mutate.

aitemplate.compiler.transform.mark_param_tensor.mark_special_views(sorted_graph: List[Tensor])[source]

Associate each tensor with an external tensor if any of the conditions are true: 1. The tensor is a view-of-a-view of an external tensor. 2. The tensor is a view of an input, constant or output tensor (i.e. external tensor).

Parameters:: sorted_graph (List[Tensor]) – The graph to mutate.

memory_planning

Graph pass for memory planning.

Functions:

`multistream_max_mem_parallel_ops`()	Maximum number of parallel operators used in memory planning for simple multi-stream mode.
`multistream_mode`()	Multi-stream mode.
`simple_multistream_memory_planning`(sorted_graph)	A specialized case for simple multi-stream execution.
`split_simple_multistream_parallel_ops`(...)	Make sure that no more than max_parallel_ops operators are run in parallel.

aitemplate.compiler.transform.memory_planning.multistream_max_mem_parallel_ops() → int[source]

Maximum number of parallel operators used in memory planning for simple multi-stream mode. Larger value imply higher level of possible parallelism, but higher memory allocations.

This option is independent from AIT_MULTISTREAM_EXTRA_STREAMS.

For example, say, there are 100 ops that can be run in parallel.

Example 1: AIT_MULTISTREAM_EXTRA_STREAMS=4 and AIT_MULTISTREAM_MAX_MEM_PARALLEL_OPS=100. In this case 5 streams will be used (1 base and 4 extra), every stream gets 20 operators and no inter-stream barriers are used. Memory planning is done for 100 parallel ops.

Example 2: AIT_MULTISTREAM_EXTRA_STREAMS=4 and AIT_MULTISTREAM_MAX_MEM_PARALLEL_OPS=5. In this case 5 streams will be used (1 base and 4 extra), there will be 20 waves separated by inter-stream barriers, every stream gets 1 operator for every wave. Memory planning is done for 20 waves of 5 parallel ops each.

aitemplate.compiler.transform.memory_planning.multistream_mode() → int[source]: Multi-stream mode. 0 - no multistream. 1 - simple multistream. Default: 0.

aitemplate.compiler.transform.memory_planning.simple_multistream_memory_planning(sorted_graph: List[Tensor])[source]: A specialized case for simple multi-stream execution. It uses more or slightly more GPU memory than greedy_by_size_memory_planner, depending on the input graph, but still significantly less than naive_memory_planning.

aitemplate.compiler.transform.memory_planning.split_simple_multistream_parallel_ops(ops_by_order, max_parallel_ops: int)[source]

Make sure that no more than max_parallel_ops operators are run in parallel.

Say, on the first step op1, op2 and op3 can be executed in parallel. On the second one, it is op4 and op5. On the third one it is op6, op7, op8, op9. Then, ops_by_order is something like

{ 1: [op1, op2, op3], 2: [op4, op5], 3: [op6, op7, op8, op9] }

Given max_parallel_ops=2, the output will be:: [[op1, op2], [op3], [op4, op5], [op6, op7], [op8, op9]]

Parameters:

ops_by_order (Dict[int, List[Operator]]) – A dictionary, its keys represent the execution order and its values represent operators that are executed in parallel.
max_parallel_ops (int) – Number of operators that are allowed to be run in parallel
Output (List[List[Operator]]) – transformed sequence of operators to execute.

name_graph

Graph pass to assign names to a sorted graph.

Classes:

`IntVar`(values[, name, symbolic_value])	An IntVar represents a dynamic dimension.
`IntVarTensor`(int_var[, name, src_ops, ...])	A special tensor which represents an IntImm / IntVar.
`JaggedIntVar`(total_length, batch_dim, ...)	JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself.

Functions:

`dedup_symbolic_name`(sorted_graph)	Rename all shape variable that are identical to the same name.
`name_graph`(sorted_graph)	Provide each tensor and operator with a unique valid C variable name

class aitemplate.compiler.transform.name_graph.IntVar(values: List[int], name: Optional[str] = None, symbolic_value: Optional[Basic] = None)[source]

An IntVar represents a dynamic dimension. IntVar and IntImm (see below) are used together to represent a Tensor’s shape.

IntVar supports basic arithmetic operations, and returns the most conservative IntVar w.r.t. range of _attrs[“values”].

Methods:

`lower_bound`()	Returns lower bound of this dynamic dim.
`pseudo_code`([with_shape])	Returns a string containing pseudo code of this object.
`symbolic_value`()	Returns the symbolic value of this dynamic dim.
`upper_bound`()	Returns upper bound of this dynamic dim.

lower_bound() → int[source]: Returns lower bound of this dynamic dim.

pseudo_code(with_shape=False) → str[source]

Returns a string containing pseudo code of this object.

Parameters:: with_shape (bool) – Marks whether to include shape info in the returned pseudo code.
Returns:: Pseudo code.
Return type:: str

symbolic_value()[source]: Returns the symbolic value of this dynamic dim.

upper_bound() → int[source]: Returns upper bound of this dynamic dim.

class aitemplate.compiler.transform.name_graph.IntVarTensor(int_var: IntVar, name: Optional[str] = None, src_ops: Optional[Set[Node]] = None, dst_ops: Optional[Set[Node]] = None, dtype: str = 'float16', is_input: bool = False, is_output: bool = False, value: Optional[Any] = None, is_view_of: Optional[Any] = None)[source]

A special tensor which represents an IntImm / IntVar. This Tensor can be used as inputs of some Operators (e.g. reshape, layernorm). An IntVarTensor instead of IntVar is used here to keep reference to src_ops and dst_ops.

Methods:

pseudo_code([with_shape])

Returns a string containing pseudo code of this object.

pseudo_code(with_shape=True) → str[source]

Returns a string containing pseudo code of this object.

Parameters:: with_shape (bool) – Marks whether to include shape info in the returned pseudo code.
Returns:: Pseudo code.
Return type:: str

class aitemplate.compiler.transform.name_graph.JaggedIntVar(total_length: IntVar, batch_dim: IntVar, jagged_dims: List[JaggedDim])[source]

JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself. JaggedIntVar is used as the first dimension in jagged Tensors’ shape (this is, basically, what makes a Tensor jagged). E.g., a JaggedIntVar with a single JaggedDim represents a single dynamic dimension encoding a batch of variable sequence length. For the batch size of B, in some sources this is indicated as sum_B(N_B): the sum of individual sequence lengths: N_1, N_2, …, N_B of B sequences. This sum is represented as a single dynamic dimension: total_length, with B being defined by the batch_dim.

Because JaggedIntVar is an IntVar, it can be treated so by the AIT ops that are unaware of the jagged Tensor semantics. But the ops that are aware can interpret the JaggedIntVar as the first dimension of the jagged Tensor by specifically processing the underlying batch_dim and jagged_dims.

If there is more than one JaggedDim in a JaggedIntVar, those jagged dimensions are nested within the single dynamic dimension. E.g., if there are two JaggedDims, the JaggedIntVar represents a batch of B (batch_dim) variable-length sequences, each in turn consisting of variable-length sequences. In principle, the nesting can be arbitrarily deep, but in practice it’s usually just a single JaggedDim.

JaggedIntVar should not be created directly. Please use the make_jagged op for creating a jagged Tensor from a normal Tensor, the offsets, and the metadata (like batch_dim and jagged_dims). The make_jagged op creates the corresponding JaggedIntVar under the hood.

Methods:

`batch_dim`()	The batch_dim of the JaggedIntVar.
`get_max_dense_shape`()	Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to.
`jagged_dims`()	The jagged_dims of the JaggedIntVar.
`offsets_struct_type`()	The type of the offsets struct variable used in runtime.
`offsets_type`()	The type of the offsets of the JaggedIntVar's jagged_dims.
`offsets_var_name`()	The name of the offsets struct variable in runtime.
`total_length`()	The total_length dimension the JaggedIntVar is based on.

batch_dim() → IntVar[source]: The batch_dim of the JaggedIntVar.

get_max_dense_shape() → List[IntVar][source]: Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to. The result has the batch_dim as the first item and the IntImm with the max_value of each JaggedDim that follows.

jagged_dims() → List[JaggedDim][source]: The jagged_dims of the JaggedIntVar.

offsets_struct_type() → str[source]: The type of the offsets struct variable used in runtime.

offsets_type() → str[source]: The type of the offsets of the JaggedIntVar’s jagged_dims.

offsets_var_name() → str[source]: The name of the offsets struct variable in runtime.

total_length() → IntVar[source]: The total_length dimension the JaggedIntVar is based on.

aitemplate.compiler.transform.name_graph.dedup_symbolic_name(sorted_graph: List[Tensor]) → None[source]: Rename all shape variable that are identical to the same name. :param sorted_graph: Input graph to be simplified :type sorted_graph: List[Tensor]

aitemplate.compiler.transform.name_graph.name_graph(sorted_graph: List[Tensor]) → None[source]

Provide each tensor and operator with a unique valid C variable name

Parameters:

sorted_graph (List[Tensor]) – Input graph to be named
reset_counters (bool) – If True, reset counters which are used to name tensors and functions. (Default: False)

optimize_graph

Applies graph transformations.

Functions:

`dedup_make_jagged_ops`(sorted_graph[, workdir])	Deduplicate make_jagged ops in the graph.
`fuse_bmm_permute`(sorted_graph, _)	Fuse bmm + permute021 ops.
`fuse_duplicate_fused_elementwise`(...)	This pass finds all duplicate fused elementwise ops and fuses them once more.
`fuse_elementwise`(sorted_graph[, workdir])	Given a sorted graph, returns a sorted graph with fused_elementwise ops on fusable elementwise ops.
`fuse_expand_bmm`(sorted_graph[, workdir])	Transform expand + bmm into a single bmm op.
`fuse_mm_reshape_permute`(sorted_graph[, workdir])	Fuse GEMM/BMM + reshape + permute into a single op
`fuse_permute_bmm_and_gemm`(sorted_graph[, ...])	Fuse [permute021 + bmm] and [permute(0, 1) + gemm].
`fuse_single_source_parallel_gemms`(sorted_graph)	This pass fuses patterns like # x: [m, k], w_i: [n_i, k], b_i: [n_i] y1 = gemm_rcr_bias()(x, w1, b1) y2 = gemm_rcr_bias()(x, w2, b2) ...
`merge_view_ops`(sorted_graph[, workdir])	Merge consecutive view ops.
`move_view_op_before_concat`(sorted_graph[, ...])	This transformation turns "cat + view_op + cat" into "view_op + cat + cat".
`optimize_graph`(sorted_graph, workdir[, optimize])	Applies graph optimizations, including
`process_singleton_elementwise`(sorted_graph)	A dummy pass which enables codegen for any elementwise op without fusing it with neighbors
`remove_elementwise_no_ops`(sorted_graph[, ...])	elementwise no-ops (*/1, +-0)
`split_large_concat_ops`(sorted_graph, _)	Our concatenate CUDA kernel takes an input meta argument whose size is proportional to the number of inputs.
`split_large_slice_scatter_ops`(sorted_graph, _)	Our slice_scatter CUDA kernel takes an input meta argument whose size is proportional to the number of inputs.
`split_large_split_ops`(sorted_graph, _)	Our split CUDA kernel takes an output meta argument whose size is proportional to the number of outputs.
`transform_permute_to_reshape`(sorted_graph[, ...])	Convert permute to reshape wherever applicable.

aitemplate.compiler.transform.optimize_graph.dedup_make_jagged_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Deduplicate make_jagged ops in the graph.

The rationale is to eliminate redundant offset validation as well as make the implicit jagged Tensors (sources) in the graph explicit, by replacing their total_length dimension with the corresponding JaggedIntVar.

The pass is performed in the following steps:

Collect the metadata of the existing make_jagged ops.

Remove make_jagged ops from the graph where possible.

Apply new make_jagged ops to the (bundled) source inputs.

Replace total_length dimensions with new JaggedIntVars.

See the docstrings of the individual steps’ helper functions above for more details.

aitemplate.compiler.transform.optimize_graph.fuse_bmm_permute(sorted_graph: List[Tensor], _: str) → List[Tensor][source]: Fuse bmm + permute021 ops. The second argument is unused, it’s only here to make the type of this function the same as the others called in optimize_graph.

aitemplate.compiler.transform.optimize_graph.fuse_duplicate_fused_elementwise(sorted_graph: List[Tensor], _workdir: str) → List[Tensor][source]

This pass finds all duplicate fused elementwise ops and fuses them once more. It assumes any fuse elementwise passes are complete.

We do the fusion by taking all the duplicate fused elementwise ops and effectively deleting all but one. We make sure to transfer the outputs and output_accessors of the duplicate ops to the remaining op. That means, the newly fused op will have multiple outputs.

Parameters:

sorted_graph (List[Tensor]) – Input graph
_workdir (str) – Required by optimize_graph.py

Returns:

sorted_graph – Modified input graph with duplicate fused elementwise ops fused together.

Return type:

List[Tensor]

aitemplate.compiler.transform.optimize_graph.fuse_elementwise(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Given a sorted graph, returns a sorted graph with fused_elementwise ops on fusable elementwise ops.

aitemplate.compiler.transform.optimize_graph.fuse_expand_bmm(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Transform expand + bmm into a single bmm op.

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – workdir, by default None

Returns:

Optimized graph

Return type:

List[Tensor]

aitemplate.compiler.transform.optimize_graph.fuse_mm_reshape_permute(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Fuse GEMM/BMM + reshape + permute into a single op

Parameters:

sorted_graph (List[Tensor]) – input graph
workdir (str, optional) – current workdir for dumping debug info. Defaults to None.

Returns:

optimized graph

Return type:

List[Tensor]

aitemplate.compiler.transform.optimize_graph.fuse_permute_bmm_and_gemm(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Fuse [permute021 + bmm] and [permute(0, 1) + gemm].

Note that for the latter fusion, we require that this pass takes place before any gemm + elementwise fusions.

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – working dir, by default None

Returns:

Fused graph

Return type:

List[Tensor]

aitemplate.compiler.transform.optimize_graph.fuse_single_source_parallel_gemms(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

This pass fuses patterns like # x: [m, k], w_i: [n_i, k], b_i: [n_i] y1 = gemm_rcr_bias()(x, w1, b1) y2 = gemm_rcr_bias()(x, w2, b2) …

into: # x: [m, k], w: [sum(n_i), k], b: [sum(n_i)] w = concatenate()([w1, w2], dim=0) b = concatenate()([b1, b2], dim=0) y = gemm_rcr_bias()(x, w, b) y1, y2 = split()(y, n)

For w and b, we rely on constant folding to preprocess them. y1 and y2 would be written directly from y’s op. It is required that all the gemm ops have the same layouts.

On graph pass ordering, we need to make sure this pass runs before any other pass that modifies gemm and concat input/output TensorAccessors.

Parameters:: sorted_graph (List[Tensor]) – a sorted list of tensors
Returns:: the transformed graph with all ops sorted
Return type:: List[Tensor]

aitemplate.compiler.transform.optimize_graph.merge_view_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Merge consecutive view ops.

aitemplate.compiler.transform.optimize_graph.move_view_op_before_concat(sorted_graph: List[Tensor], wordir: Optional[str] = None) → List[Tensor][source]: This transformation turns “cat + view_op + cat” into “view_op + cat + cat”. The yielded pattern may be optimized further by the transform_memory_ops pass. Note that this pass must be invoked before transform_strided_op_and_view_op and transform_strided_ops.

aitemplate.compiler.transform.optimize_graph.optimize_graph(sorted_graph: List[Tensor], workdir: str, optimize=True) → List[Tensor][source]

Applies graph optimizations, including

fuse permute and bmm
fuse permute and gemm
transform odd alignment
fuse conv and elementwise
fuse gemm and elementwise
fuse elementwise ops
fuse parallel gemms
fuse group ops
transform special ops
transform strided ops
fuse bmm and permute
transform memory ops
apply padding

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str) – working directory

Returns:

Fused graph

Return type:

List[Tensor]

aitemplate.compiler.transform.optimize_graph.process_singleton_elementwise(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: A dummy pass which enables codegen for any elementwise op without fusing it with neighbors

aitemplate.compiler.transform.optimize_graph.remove_elementwise_no_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: elementwise no-ops (*/1, +-0)

aitemplate.compiler.transform.optimize_graph.split_large_concat_ops(sorted_graph: List[Tensor], _: str) → List[Tensor][source]: Our concatenate CUDA kernel takes an input meta argument whose size is proportional to the number of inputs. In extreme cases, the total size of the params of a concatenate kernel may exceed the limit imposed by the CUDA compiler. In such cases, we split the concatenate op into separate ones, each of which takes the original output and inputs with correct input_masks values.

aitemplate.compiler.transform.optimize_graph.split_large_slice_scatter_ops(sorted_graph: List[Tensor], _: str) → List[Tensor][source]: Our slice_scatter CUDA kernel takes an input meta argument whose size is proportional to the number of inputs. In extreme cases, the total size of the kernel function params may exceed the limit imposed by the CUDA compiler. In such cases, we split the slice_scatter op into separate ones, each of which takes the original output and inputs with correct input_masks values.

aitemplate.compiler.transform.optimize_graph.split_large_split_ops(sorted_graph: List[Tensor], _: str) → List[Tensor][source]: Our split CUDA kernel takes an output meta argument whose size is proportional to the number of outputs. In extreme cases, the total size of the params of a split kernel may exceed the limit imposed by the CUDA compiler. In such cases, we split the split op into separate ones.

aitemplate.compiler.transform.optimize_graph.transform_permute_to_reshape(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Convert permute to reshape wherever applicable.

When permute op involves moving one or more dimensions with size 1 around where the order of non-singular dimensions is preserved, it’s basically a reshape op, i.e. the underlying memory layout does not change.

If a permute op has a non-empty input tensor accessor, its original shape should be used to determine whether it can be converted to reshape. In this case the shape of the actual input tensor might not match the rank of the permutation (but the original shape does) - see the second example below.

Examples

[256x5x1x32] -> [256x5x32x1] (with 0132) is a reshape [256x5x32] -> [256x5x1x32] (with 0132) is a reshape [256x1x5x1x32] -> [256x5x32x1x1] (with 02431) is a reshape [256x5x1x32] -> [256x32x5x1] (with 0312) is not a reshape

Parameters:

sorted_graph (List[Tensor]) – input graph
workdir (str, optional) – current workdir for dumping debug info. Defaults to None.

Returns:

optimized graph

Return type:

List[Tensor]

profile

Graph pass to invoke profiling.

Classes:

`GemmProfilerPostprocessingDelegate`()	Object which collects profiler results after profiler executables complete, updates profiler results cache and the gemm nodes' attrs after all profilers complete.
`OrderedDict`	Dictionary that remembers insertion order
`ProfilerRunner`(devices, postprocessing_delegate)	Another parallel runner to execute profilers on multiple GPUs in parallel It uses a process pool for implementation, avoiding process creation overhead The size of the process pool is equal to the number of provided GPUs, so ~ideally~ each process should execute a profiler on its dedicated GPU.
`datetime`(year, month, day[, hour[, minute[, ...)	The year, month and day arguments are required.

Functions:

`deepcopy`(x[, memo, _nil])	Deep copy operation on arbitrary Python objects.
`force_profiler_cache`()	Force the profiler to use the cached results.
`profile`(sorted_graph[, workdir, devices, ...])	Profiles kernels.

class aitemplate.compiler.transform.profile.GemmProfilerPostprocessingDelegate[source]

Object which collects profiler results after profiler executables complete, updates profiler results cache and the gemm nodes’ attrs after all profilers complete.

Methods:

`add_instance`(instance)	As a profiler executable completes, collect the result
`postprocess_results`()	When all profiler executables complete, find the best instance (min runtime per op name, profiler executable and exec_key (i.e.

add_instance(instance: ProfileResult)[source]: As a profiler executable completes, collect the result

postprocess_results()[source]: When all profiler executables complete, find the best instance (min runtime per op name, profiler executable and exec_key (i.e. gemm shape mnk) across multiple split_k values) The best instance is cached, and written into corresponding gemm nodes in the graph

class aitemplate.compiler.transform.profile.OrderedDict[source]

Dictionary that remembers insertion order

Methods:

`clear`()
`copy`()
`fromkeys`([value])	Create a new ordered dictionary with keys from iterable and values set to value.
`items`()
`keys`()
`move_to_end`(key[, last])	Move an existing element to the end (or beginning if last is false).
`pop`(k[,d])	value.
`popitem`([last])	Remove and return a (key, value) pair from the dictionary.
`setdefault`(key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`([E, ]**F)	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

clear() → None. Remove all items from od.

copy() → a shallow copy of od

fromkeys(value=None): Create a new ordered dictionary with keys from iterable and values set to value.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

move_to_end(key, last=True)

Move an existing element to the end (or beginning if last is false).

Raise KeyError if the element does not exist.

pop(k[, d]) → v, remove specified key and return the corresponding: value. If key is not found, d is returned if given, otherwise KeyError is raised.

popitem(last=True)

Remove and return a (key, value) pair from the dictionary.

Pairs are returned in LIFO order if last is true or FIFO order if false.

setdefault(key, default=None)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from dict/iterable E and F.: If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values

class aitemplate.compiler.transform.profile.ProfilerRunner(devices: List[str], postprocessing_delegate, timeout: int = 500)[source]

Another parallel runner to execute profilers on multiple GPUs in parallel It uses a process pool for implementation, avoiding process creation overhead The size of the process pool is equal to the number of provided GPUs, so ~ideally~ each process should execute a profiler on its dedicated GPU. This property hasn’t been properly verified yet, however, the results are empirically better compared to the previous runner.

Methods:

`join`()	Wait for subprocesses completion or timeout; postprocess the profiler results with delegate(s)
`push`(cmds, process_result_callback)	Schedule the profiler for execution in a separate process, Call the callback after subprocess completion

join()[source]: Wait for subprocesses completion or timeout; postprocess the profiler results with delegate(s)

push(cmds: List[str], process_result_callback: Callable)[source]

Schedule the profiler for execution in a separate process, Call the callback after subprocess completion

Parameters:

cmds (List[str]) – argv for the launched profiler
process_result_callback (Callable) – Called after subprocess completion in the main process (but possibly not main thread). Currently used to aggregate profiler results, so the callable takes result and postprocessing_delegate parameters It is also used to propagate the profiler launch context to the aggregation point, namely, split_k value for the gemm profilers

class aitemplate.compiler.transform.profile.datetime(year, month, day[, hour[, minute[, second[, microsecond[, tzinfo]]]]])[source]

The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints.

Methods:

`astimezone`	tz -> convert to local time in new timezone tz
`combine`	date, time -> datetime with same date and time fields
`ctime`	Return ctime() style string.
`date`	Return date object with same year, month and day.
`dst`	Return self.tzinfo.dst(self).
`fromisoformat`	string -> datetime from datetime.isoformat() output
`fromtimestamp`	timestamp[, tz] -> tz's local time from POSIX timestamp.
`isoformat`	[sep] -> string in ISO 8601 format, YYYY-MM-DDT[HH[:MM[:SS[.mmm[uuu]]]]][+HH:MM].
`now`()	Returns new datetime object representing current time local to tz.
`replace`	Return datetime with new specified fields.
`strptime`	string, format -> new datetime parsed from a string (like time.strptime()).
`time`	Return time object with same time but with tzinfo=None.
`timestamp`	Return POSIX timestamp as float.
`timetuple`	Return time tuple, compatible with time.localtime().
`timetz`	Return time object with same time and tzinfo.
`tzname`	Return self.tzinfo.tzname(self).
`utcfromtimestamp`	Construct a naive UTC datetime from a POSIX timestamp.
`utcnow`	Return a new datetime representing UTC day and time.
`utcoffset`	Return self.tzinfo.utcoffset(self).
`utctimetuple`	Return UTC time tuple, compatible with time.localtime().

astimezone(): tz -> convert to local time in new timezone tz

combine(): date, time -> datetime with same date and time fields

ctime(): Return ctime() style string.

date(): Return date object with same year, month and day.

dst(): Return self.tzinfo.dst(self).

fromisoformat(): string -> datetime from datetime.isoformat() output

fromtimestamp(): timestamp[, tz] -> tz’s local time from POSIX timestamp.

isoformat(): [sep] -> string in ISO 8601 format, YYYY-MM-DDT[HH[:MM[:SS[.mmm[uuu]]]]][+HH:MM]. sep is used to separate the year from the time, and defaults to ‘T’. The optional argument timespec specifies the number of additional terms of the time to include. Valid options are ‘auto’, ‘hours’, ‘minutes’, ‘seconds’, ‘milliseconds’ and ‘microseconds’.

now()

Returns new datetime object representing current time local to tz.

tz
Timezone object.

If no tz is specified, uses local timezone.

replace(): Return datetime with new specified fields.

strptime(): string, format -> new datetime parsed from a string (like time.strptime()).

time(): Return time object with same time but with tzinfo=None.

timestamp(): Return POSIX timestamp as float.

timetuple(): Return time tuple, compatible with time.localtime().

timetz(): Return time object with same time and tzinfo.

tzname(): Return self.tzinfo.tzname(self).

utcfromtimestamp(): Construct a naive UTC datetime from a POSIX timestamp.

utcnow(): Return a new datetime representing UTC day and time.

utcoffset(): Return self.tzinfo.utcoffset(self).

utctimetuple(): Return UTC time tuple, compatible with time.localtime().

aitemplate.compiler.transform.profile.deepcopy(x, memo=None, _nil=[])[source]

Deep copy operation on arbitrary Python objects.

See the module’s __doc__ string for more info.

aitemplate.compiler.transform.profile.force_profiler_cache() → bool[source]: Force the profiler to use the cached results. The profiler will throw a runtime exception if it cannot find cached results. This env may be useful to capture any cache misses due to cache version updates or other relevant code changes.

aitemplate.compiler.transform.profile.profile(sorted_graph: List[Tensor], workdir='./tmp', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.MAX, timeout=500)[source]

Profiles kernels.

Parameters:

sorted_graph (List[Tensor]) – A sorted graph which contains all functions for profiling.
workdir (str, optional) – The base dir to generate profiling source codes. By default “./tmp”
devices (list, optional) – A list of device ids which can be used for profiling. By default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile() By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

refine_graph

Graph pass to dedup operators with same signatures.

Functions:

`get_sorted_ops`(tensors)	Produces the exact execution sequence of operators.
`refine_graph`(sorted_graph)	Graph pass to dedup operators with same signatures.

aitemplate.compiler.transform.refine_graph.get_sorted_ops(tensors) → List[Any][source]: Produces the exact execution sequence of operators. This matches backend/codegen.py, ModelContainerGenerator.append_all_tensors()

aitemplate.compiler.transform.refine_graph.refine_graph(sorted_graph: List[Tensor])[source]

Graph pass to dedup operators with same signatures.

Parameters:: sorted_graph (List[Tensor]) – Input graph

remove_no_ops

Remove no-ops from the graph.

This is a bit different from remove_unused_ops. That pass is based on the graph structure - it removes ops tha are not connected to the src_ops of any tensor. This pass, on the other hand, removes things which are logically no-ops, like expands with no expanded dims.

The reason it’s not combined with removed_unused_ops is that many of the passes in this file will want to call sanitize_sorted_graph, but sanitize_sorted_graph calls remove_unused_ops.

Also, even if the passes in this file avoided sanitize_sorted_graph, many other unrelated passes use sanitize_sorted_graph. We don’t need to call the passes in this file more than once.

Classes:

`ExpandDimensionType`(value)	An enumeration.
`JaggedIntVar`(total_length, batch_dim, ...)	JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself.

Functions:

remove_no_ops(sorted_graph)

Remove no-ops from the graph.

class aitemplate.compiler.transform.remove_no_ops.ExpandDimensionType(value)[source]: An enumeration.

class aitemplate.compiler.transform.remove_no_ops.JaggedIntVar(total_length: IntVar, batch_dim: IntVar, jagged_dims: List[JaggedDim])[source]

JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself. JaggedIntVar is used as the first dimension in jagged Tensors’ shape (this is, basically, what makes a Tensor jagged). E.g., a JaggedIntVar with a single JaggedDim represents a single dynamic dimension encoding a batch of variable sequence length. For the batch size of B, in some sources this is indicated as sum_B(N_B): the sum of individual sequence lengths: N_1, N_2, …, N_B of B sequences. This sum is represented as a single dynamic dimension: total_length, with B being defined by the batch_dim.

Because JaggedIntVar is an IntVar, it can be treated so by the AIT ops that are unaware of the jagged Tensor semantics. But the ops that are aware can interpret the JaggedIntVar as the first dimension of the jagged Tensor by specifically processing the underlying batch_dim and jagged_dims.

If there is more than one JaggedDim in a JaggedIntVar, those jagged dimensions are nested within the single dynamic dimension. E.g., if there are two JaggedDims, the JaggedIntVar represents a batch of B (batch_dim) variable-length sequences, each in turn consisting of variable-length sequences. In principle, the nesting can be arbitrarily deep, but in practice it’s usually just a single JaggedDim.

JaggedIntVar should not be created directly. Please use the make_jagged op for creating a jagged Tensor from a normal Tensor, the offsets, and the metadata (like batch_dim and jagged_dims). The make_jagged op creates the corresponding JaggedIntVar under the hood.

Methods:

`batch_dim`()	The batch_dim of the JaggedIntVar.
`get_max_dense_shape`()	Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to.
`jagged_dims`()	The jagged_dims of the JaggedIntVar.
`offsets_struct_type`()	The type of the offsets struct variable used in runtime.
`offsets_type`()	The type of the offsets of the JaggedIntVar's jagged_dims.
`offsets_var_name`()	The name of the offsets struct variable in runtime.
`total_length`()	The total_length dimension the JaggedIntVar is based on.

batch_dim() → IntVar[source]: The batch_dim of the JaggedIntVar.

get_max_dense_shape() → List[IntVar][source]: Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to. The result has the batch_dim as the first item and the IntImm with the max_value of each JaggedDim that follows.

jagged_dims() → List[JaggedDim][source]: The jagged_dims of the JaggedIntVar.

offsets_struct_type() → str[source]: The type of the offsets struct variable used in runtime.

offsets_type() → str[source]: The type of the offsets of the JaggedIntVar’s jagged_dims.

offsets_var_name() → str[source]: The name of the offsets struct variable in runtime.

total_length() → IntVar[source]: The total_length dimension the JaggedIntVar is based on.

aitemplate.compiler.transform.remove_no_ops.remove_no_ops(sorted_graph: List[Tensor]) → List[Tensor][source]

Remove no-ops from the graph.

Parameters:: sorted_graph (List[Tensor]) – Input graph
Returns:: Graph after remove no-ops
Return type:: List[Tensor]

remove_unused_ops

Remove useless operators from a sorted_graph.

Functions:

remove_unused_ops(sorted_graph)

Remove ops which are not src operators of tensors in the input sorted_graph.

aitemplate.compiler.transform.remove_unused_ops.remove_unused_ops(sorted_graph: List[Tensor]) → None[source]: Remove ops which are not src operators of tensors in the input sorted_graph.

toposort

Graph pass for topological sort.

Functions:

toposort(nodes)

Generate sorted nodes by topological order.

aitemplate.compiler.transform.toposort.toposort(nodes: Union[Tensor, List[Tensor]]) → List[Tensor][source]

Generate sorted nodes by topological order. This is the foundation of all graph passes.

Parameters:: nodes (Union[Tensor, List[Tensor]]) – The output of the model
Returns:: Sorted graph
Return type:: List[Tensor]

transform_memory_ops

Perform memory operator related transformations.

Classes:

dynamic_slice()

Cut the source tensor into slices specified by a list of start indices and a list of end indices.

Functions:

`toposort`(nodes)	Generate sorted nodes by topological order.
`transform_memory_ops`(sorted_graph[, workdir])	Eliminates unnecessary cat / split ops.

class aitemplate.compiler.transform.transform_memory_ops.dynamic_slice[source]

Cut the source tensor into slices specified by a list of start indices and a list of end indices.

Parameters:

x (Tensor) – input tensor
start_indices (List[int]) – similar to PyTorch and numpy, indices can be negative
end_indices (List[int]) – end_index is not included. Similar to PyTorch and numpy, indices can be negative.

Returns:

the list of sliced tensors.

Return type:

List[Tensor]

Methods:

`gen_function`()	Generates function source code string.
`normalize_start_end_indices`(dim_val, start, end)	return normalized start and end indices which fall into a well-formed range like below: 0 <= start <= end <= dim_val

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

static normalize_start_end_indices(dim_val: int, start: int, end: int) → List[int][source]: return normalized start and end indices which fall into a well-formed range like below: 0 <= start <= end <= dim_val

aitemplate.compiler.transform.transform_memory_ops.toposort(nodes: Union[Tensor, List[Tensor]]) → List[Tensor][source]

Generate sorted nodes by topological order. This is the foundation of all graph passes.

Parameters:: nodes (Union[Tensor, List[Tensor]]) – The output of the model
Returns:: Sorted graph
Return type:: List[Tensor]

aitemplate.compiler.transform.transform_memory_ops.transform_memory_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Eliminates unnecessary cat / split ops.

transform_odd_alignment

Add permute for gemm/bmm if alignment is odd.

Functions:

transform_odd_alignment(sorted_graph[, workdir])

Transform odd alignments to even alignments for bmm operators

aitemplate.compiler.transform.transform_odd_alignment.transform_odd_alignment(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Transform odd alignments to even alignments for bmm operators

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – workdir, by default None

Returns:

Optimized graph

Return type:

List[Tensor]

transform_special_ops

Perform graph transformation specifically for gemm -> gemm_special. Check each transform function summary for specific pattern to be transformed.

Functions:

transform_special_ops(sorted_graph[, workdir])

Transform generic gemm/conv ops to special ops.

aitemplate.compiler.transform.transform_special_ops.transform_special_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]

Transform generic gemm/conv ops to special ops.

Parameters:

sorted_graph (List[Tensor]) – Input graph
workdir (str, optional) – workdir, by default None

Returns:

Transformed graph

Return type:

List[Tensor]

transform_strided_op_and_view_op

Perform transformations to fuse view ops with strided op by using TensorAccessor.

Classes:

StableSet([s])

class aitemplate.compiler.transform.transform_strided_op_and_view_op.StableSet(s: Optional[Iterable[Any]] = None)[source]

Methods:

`add`(value)	Add an element.
`clear`()	This is slow (creates N new iterators!) but effective.
`discard`(value)	Remove an element.
`remove`(value)	Remove an element.

add(value) → None[source]: Add an element.

clear()[source]: This is slow (creates N new iterators!) but effective.

discard(value) → None[source]: Remove an element. Do not raise an exception if absent.

remove(value) → None[source]: Remove an element. If not a member, raise a KeyError.

transform_strided_ops

Perform transformations on ops which support strided inputs / outputs.

Functions:

`detect_target`(**kwargs)	Detect GPU target based on nvidia-smi and rocminfo
`transform_strided_ops`(sorted_graph[, workdir])	Add strided inputs / outputs to ops to avoid unnecessary data movement.

aitemplate.compiler.transform.transform_strided_ops.detect_target(**kwargs)[source]

Detect GPU target based on nvidia-smi and rocminfo

Returns:: CUDA or ROCM target
Return type:: Target

aitemplate.compiler.transform.transform_strided_ops.transform_strided_ops(sorted_graph: List[Tensor], workdir: Optional[str] = None) → List[Tensor][source]: Add strided inputs / outputs to ops to avoid unnecessary data movement.

transform_strided_slice

Perform transformations on slice and strided ops.