aitemplate.compiler.ops

AIT operators.

Classes:

`EpilogueOp`(value)	Epilogue enum.
`FuncEnum`(value)	Elementwise func enum.
`JaggedDim`(min_value, max_value)	A class representing a single jagged dimension encoded within a JaggedIntVar.
`JaggedIntVar`(total_length, batch_dim, ...)	JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself.
`argmax`([dim])	Returns the indices of the maximum value of all elements across a dimension in the input tensor.
`avg_pool2d`(kernel_size, stride, pad)	Applies a 2D average pooling over an input signal composed of several input planes.
`batch_gather`()	Gathers values of the input tensor specified by indicies.
`batch_layernorm_sigmoid_mul`([normalized_shape])	batch_layernorm_sigmoid_mul op.
`batched_dense_vec_jagged_2d_mul`()	Compute a dense tensor containing batched matrix multiplication of a batched dense vector and a batched jagged matrix.
`batched_nms`([iou_threshold, keep_n])	Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU) in a batched fashion.
`bert_embeddings`()	Construct the embeddings from word, position and token_type embeddings.
`bmm_ccc`()	Batch GEMM specialization for A[ColMajor], B[ColMajor], C[ColMajor].
`bmm_ccc_add`()	Batch GEMM specialization for A[ColMajor], B[ColMajor], C[ColMajor] with Add.
`bmm_ccr`()	Batch GEMM specialization for A[ColMajor], B[ColMajor], C[RowMajor].
`bmm_ccr_add`()	Batch GEMM specialization for A[ColMajor], B[ColMajor], C[RowMajor] with Add.
`bmm_crc`()	Batch GEMM specialization for A[ColMajor], B[RowMajor], C[ColMajor].
`bmm_crc_add`()	Batch GEMM specialization for A[ColMajor], B[RowMajor], C[ColMajor] with Add.
`bmm_crr`()	Batch GEMM specialization for A[ColMajor], B[RowMajor], C[RowMajor].
`bmm_crr_add`()	Batch GEMM specialization for A[ColMajor], B[RowMajor], C[RowMajor] with Add.
`bmm_rcc`()	Batch GEMM specialization for A[RowMajor], B[ColMajor], C[ColMajor].
`bmm_rcc_add`()	Batch GEMM specialization for A[RowMajor], B[ColMajor], C[ColMajor] with Add.
`bmm_rcr`()	Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor].
`bmm_rcr_add`()	Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor] with Add.
`bmm_rcr_n1`()
`bmm_rcr_permute`(shape[, layout])	Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor] with permutation on output to given layout.
`bmm_rcr_softmax`()	BatchGemm with softmax, A: row_major, B: column_major, C: row_major, A: [b, m, k], B: [b, n, k], C: [b, m, n]
`bmm_rrc`()	Batch GEMM specialization for A[RowMajor], B[RowMajor], C[ColMajor]
`bmm_rrc_add`()	Batch GEMM specialization for A[RowMajor], B[RowMajor], C[ColMajor] with Add.
`bmm_rrr`()	Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor]
`bmm_rrr_add`()	Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor] with Add.
`bmm_rrr_k1_tanh`()
`bmm_rrr_permute`(shape[, layout])	Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor] with permutation on output to given layout.
`bmm_softmax_bmm`([scale])	BMM_RCR + Softmax + BMM_RRR Specialization This fusion is commonly used in Attention family
`bmm_softmax_bmm_permute`(shape[, scale, ...])	BMM_RCR + Softmax + BMM_RRR + Permute Specialization This fusion is commonly used in Attention family
`cast`()	Returns the cast of input tensor to specified type.
`chunk`()	Attempts to split a tensor into the specified number of chunks
`clamp`()	Clamps all elements in input into the range [min_value, max_value].
`classic_b2b_bmm`(causal_type, ...[, ...])
`concatenate`([fast_cat])	Concatenates the given sequence of seq tensors in the given dimension.
`concatenate_tanh`()	The fusion of concatenate and tanh.
`conv2d`(stride, pad[, dilate, group])	Applies a 2D convolution on input with size (N, H, W, C_in), and produces output with size (N, H_out, W_out, C_out) where N is batch size, H, W are the height and width of the image in pixels, and C is the number of channels.
`conv2d_bias`(stride, pad[, dilate, group])	Conv2d with bias.
`conv2d_bias_add`(stride, pad[, dilate, group])	Conv2d_bias_add.
`conv2d_bias_add_hardswish`(stride, pad[, ...])	Conv2d_bias_add_hardswish.
`conv2d_bias_add_relu`(stride, pad[, dilate, ...])	Conv2d_bias_add_relu.
`conv2d_bias_few_channels`(stride, pad[, ...])	conv2d_bias_few_channels.
`conv2d_bias_hardswish`(stride, pad[, dilate, ...])	Conv2d with bias + hardswish.
`conv2d_bias_hardswish_few_channels`(stride, pad)	conv2d_bias_hardswish_few_channels.
`conv2d_bias_relu`(stride, pad[, dilate, group])	Conv2d with bias + relu.
`conv2d_bias_relu_few_channels`(stride, pad[, ...])	conv2d_bias_relu_few_channels.
`conv2d_bias_sigmoid`(stride, pad[, dilate, group])	Conv2d with bias + sigmoid.
`conv2d_depthwise`(stride, pad[, dilate, group])	Base class of conv2d with groups.
`conv2d_depthwise_bias`(stride, pad[, dilate, ...])	Base class of conv2d with groups.
`conv3d`(stride, pad[, dilate, group])
`conv3d_bias`(stride, pad[, dilate, group])
`depthwise_conv3d`(stride, pad[, dilate, ...])
`dual_bmm_rrr_div`()	Batch GEMM specialization: BMM_RRR(A, B0) / BMM_RRR(A, B1)
`dual_gemm_rcr_fast_gelu`()	GEMM Specialization: FAST_GELU(GEMM_RCR(A, B)) * GEMM_RCR(A, B1)
`dual_gemm_rcr_silu`()	GEMM Specialization: SILU(GEMM_RCR(A, B)) * GEMM_RCR(A, B1)
`dynamic_slice`()	Cut the source tensor into slices specified by a list of start indices and a list of end indices.
`efficient_nms`([preNmsTop, nmsMaxOut, ...])	Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU).
`elementwise`(func_enum)	elementwise operator definition.
`eq`()
`expand`()	Expands a tensor's singleton dimensions.
`flash_attention`(batch_size, dropout, ...)	FlashAttention provides an implementation for fused multi-head attention module:
`flatten`([start_dim, end_dim])	Flattens input by reshaping it into a one-dimensional tensor.
`fmha_style_b2b_bmm`(causal_type, ...[, ...])	See comments at the head of this file.
`full`()	Creates a tensor of a given shape and dtype filled with the specified fill_value (float scalar).
`fused_elementwise`(elementwise_ops, inputs, ...)	fused_elementwise operator is used internally.
`gather`()	gather implementation
`ge`()
`gemm_rcr`()	GEMM Specialization for A[RowMajor], B[ColMajor], C[RowMajor]
`gemm_rcr_bias`()	GEMM Specialization: GEMM_RCR(A, B) + Bias A[RowMajor], B[ColMajor], Bias[RowMajor], C[RowMajor]
`gemm_rcr_bias_add`()	GEMM Specialization: GEMM_RCR(A, B) + Bias + D0
`gemm_rcr_bias_add_add`()	GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0 + D1)
`gemm_rcr_bias_add_add_relu`()	GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0 + D1)
`gemm_rcr_bias_add_relu`()	GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0)
`gemm_rcr_bias_fast_gelu`()	GEMM Specialization: FastGELU(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_gelu`()	GEMM Specialization: GELU(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_hardswish`()	GEMM Specialization: HardSwish(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_mul`()	GEMM Specialization: (GEMM_RCR(A, B) + Bias) * D0
`gemm_rcr_bias_mul_add`()	GEMM Specialization: (GEMM_RCR(A, B) + Bias) * D0 + D1
`gemm_rcr_bias_mul_tanh`()	GEMM Specialization: TANH((GEMM_RCR(A, B) + Bias) * D0)
`gemm_rcr_bias_permute`(shape[, layout])
`gemm_rcr_bias_relu`()	GEMM Specialization: ReLU(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_sigmoid`()	GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_sigmoid_mul`()	GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias) * D0
`gemm_rcr_bias_sigmoid_mul_tanh`()	GEMM Specialization: Tanh(Sigmoid(GEMM_RCR(A, B) + Bias) * D0)
`gemm_rcr_bias_softmax`()	gemm_rcr_bias_softmax operator.
`gemm_rcr_bias_swish`()	GEMM Specialization: SiLU(GEMM_RCR(A, B) + Bias)
`gemm_rcr_bias_tanh`()	GEMM Specialization: Tanh(GEMM_RCR(A, B) + Bias)
`gemm_rcr_fast_gelu`()	GEMM Specialization: FastGELU(GEMM_RCR(A, B))
`gemm_rcr_permute`(shape[, layout])
`gemm_rcr_permute_elup1`(args, *kwargs)
`gemm_rcr_softmax`()	gemm_rcr_softmax operator.
`gemm_rrr`()	GEMM Specialization for A[RowMajor], B[RowMajor], C[RowMajor]
`gemm_rrr_bias`()	GEMM Specialization: GEMM_RRR(A, B) + Bias A[RowMajor], B[RowMajor], Bias[RowMajor], C[RowMajor]
`gemm_rrr_bias_permute`(shape[, layout])
`gemm_rrr_permute`(shape[, layout])
`gemm_rrr_small_nk`()	Special gemm kernel for small K and N (K <= 8, N <= 8) A: [M, K] B: [K, N] C: [M, N]
`getitem`()	Retrieve a single element from a list of tuple at a certain index.
`group_gemm_rcr`()	Grouped GEMM Specialization: GEMM_RCR(A, B)
`group_gemm_rcr_bias`()	Grouped GEMM Specialization: GEMM_RCR(A, B) + Bias
`group_gemm_rcr_bias_relu`()	Grouped GEMM Specialization: ReLU(GEMM_RCR(A, B) + Bias)
`group_gemm_rcr_bias_sigmoid`()	Grouped GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias)
`group_layernorm`([normalized_shape])	group_layernorm. For each group, we expect each input to have shapes: Input shape: [M0, M1, ..., Mp, N1, N2, ..., ND] Normalized_shape: [N1, N2, ..., ND] Gamma/Beta, if not None, have the same shape as normalized_shape. Every input in the groups must have the same [M0, M1, ..., Mp] dims.
`group_layernorm_sigmoid_mul`([normalized_shape])	group_layernorm_sigmoid_mul. For each group, we expect each input to have shapes: Input shape: [M0, M1, ..., Mp, N1, N2, ..., ND] Normalized_shape: [N1, N2, ..., ND] Gamma/Beta, if not None, have the same shape as normalized_shape. Every input in the groups must have the same [M0, M1, ..., Mp] dims.
`group_norm`(num_groups, num_channels)	Standalone group norm op.
`group_norm_swish`(num_groups, num_channels)	Standalone group norm op.
`grouped_classic_b2b_bmm`(causal_type, ...[, ...])
`grouped_fmha_style_b2b_bmm`(causal_type, ...)	See comments at the head of this file.
`gt`()
`identity`()	Returns the input tensor.
`index_select`([dim])	Returns a new tensor which indexes the input tensor along dimension dim using the entries in index which is a LongTensor.
`int_elementwise`(func_enum)	int elementwise operator definition.
`jagged_lengths_to_offsets`()	Given a 1D Tensor of lengths of the sequences in a jagged Tensor, returns the corresponding 1D Tensor of offsets.
`jagged_lengths_to_presences`()	Given a 1D Tensor of lengths of the sequences in a jagged Tensor, returns a 2D Tensor of presences indicating where the data exists and where not.
`jagged_to_padded_dense`([padding_value])	Returns a dense Tensor "expanded" from the input jagged Tensor.
`layernorm`([normalized_shape])	Standalone layernorm op.
`layernorm_sigmoid_mul`(layer_norm, sigmoid, mul)	Fused layernorm_sigmoid_mul op Input shape: [M0, M1, ..., Mp, N1, N2, ..., ND] Normalized_shape: [N1, N2, ..., ND] Gamma/Beta, if not None, have the same shape as normalized_shape.
`le`()
`list_construct`()	Construct a list of tensors.
`lt`()
`make_jagged`(batch_dim, jagged_dims[, ...])	Creates jagged Tensors from normal Tensors, offsets, and metadata.
`masked_select`()	Returns a 1D tensor containing elements of the input tensor selected by the boolean mask, similar to torch.masked_select.
`max_pool2d`(kernel_size, stride, pad)	Applies a 2D max pooling over an input signal composed of several input planes.
`mem_eff_attention`(causal[, dropout, ...])	mem_eff_attention provides an implementation for fused multi-head attention module:
`multi_level_roi_align`(num_rois, pooled_size, ...)	Performs Multiple level Region of Interest (RoI) Align operator with average pooling, as described in Mask R-CNN.
`ndhwc3to8`()	Pad the 3-channel input data to 8-channel.
`ne`()
`nhwc3to4`()
`nhwc3to8`()
`nms`([preNmsTop, nmsMaxOut, iouThreshold, ...])	Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU).
`pad_last_dim`(ndim, out_dim)	Pad the last dimension of the input data to the specified length.
`padded_dense_to_jagged`(total_length)	Returns a jagged Tensor "extracted" from the input dense Tensor, given the offsets list.
`perm021fc_ccr`()	GEMM Specialization: A.permute(0, 2, 1) @ B
`perm021fc_ccr_bias`()	GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias)
`perm021fc_ccr_bias_permute`([layout])	GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias).permute(0, 2, 1)
`perm021fc_crc`()	GEMM Specialization: (A.permute(0, 2, 1) @ B)
`perm021fc_crc_bias`()	GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias)
`perm102_bmm_rcr`()	Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, n, k](col))
`perm102_bmm_rcr_bias`()	Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, n, k](col)) + bias[b, n]
`perm102_bmm_rrr`()	Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, k, n](row))
`perm102_bmm_rrr_bias`()	Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, k, n](row)) + bias[b, n]
`permute`()	Returns a tensor with its dimensions permuted.
`permute021`()	Permutes the input tensor from (B1, B2, ..., Bn, N, M) to (B1, B2, ..., Bn, M, N).
`permute0213`()	Permutes the input 4d tensor from (B, N, M, K) to (B, M, N, K).
`permute102`()	Permutes the input 3d tensor from (B, N, M) to (N, B, M).
`permute210`()	Permutes the input 3d tensor from (B, N, M) to (M, N, B).
`reduce_max`(dim[, keepdim, dtype])	Implements the reduce_max op.
`reduce_mean`(dim[, keepdim, dtype])	Implements the reduce_mean op.
`reduce_min`(dim[, keepdim, dtype])	Implements the reduce_min op.
`reduce_sum`(dim[, keepdim, dtype])	Implements the reduce_sum op.
`reshape`()	Returns a tensor with the same data and number of elements as input, but with the specified shape.
`roi_align`(num_rois, pooled_size, ...)	Performs Region of Interest (RoI) Align operator with average pooling, as described in Mask R-CNN.
`size`()	Returns the size of the input tensor.
`slice_reshape_scatter`(scatter_dim[, ...])	represent slice + concat + reshape + concat pattern with slice + concat
`slice_scatter`(scatter_dim)	This op represents a special fusion case where the inputs of a concatenate op all come from slice ops.
`softmax`()	Applies the Softmax function to a 2D input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.
`split`()	Splits the tensor into chunks on the specified dimension.
`squeeze`(dim)	Examines the specified dimension and gets rid of it if it is of size 1.
`topk`(k)	Returns the k largest elements of the given input tensor along its last dimension.
`transpose`()	Returns a tensor with its two dimensions transposed.
`transposed_conv2d`(stride, pad[, dilate, group])	Transposed conv2d.
`transposed_conv2d_bias`(stride, pad[, ...])	Transposed conv2d with bias.
`transposed_conv2d_bias_relu`(stride, pad[, ...])	Transposed conv2d with bias + relu.
`tuple_construct`()	Construct a tuple of tensors.
`unsqueeze`(dim)	Adds a dimension of size 1 at a specified location.
`upsampling2d`(scale_factor, mode)	Applies a 2D bilinear upsampling to an input signal composed of several input channels.
`upsampling2d_add`(scale_factor, mode)	Fused op for bilinear_upsampling + add.
`var`(dim, unbiased[, keepdim, dtype])	Calculates the variance of all elements in the input tensor.
`vector_norm`([ord_kind, dim, keepdim, dtype])	Vector_norm op implementation that simulates pytorch's linalg.vector_norm.
`where`()	Return a tensor of elements selected from either input or other, depending on condition.

Functions:

`convert_shape_to_IntVar`(shape)	Helper function to convert a list of mixed int/IntVar/IntImm into a list with only IntVar/IntImm.
`gen_int_var_min_max`(values[, name, ...])	A helper function to generate IntImm or IntVar depending on the length of values.
`is_symbolic`(sym_val)	Check whether sym_val is a sympy class.
`normalize_dtype`(dtype)	Returns a normalized dtype str.
`simplify_intvar_values`(sym_val)	Given a symbolic value, resolve the symbol's value range.

class aitemplate.compiler.ops.EpilogueOp(value)[source]: Epilogue enum.

class aitemplate.compiler.ops.FuncEnum(value)[source]: Elementwise func enum.

class aitemplate.compiler.ops.JaggedDim(min_value: IntVar, max_value: IntVar)[source]

A class representing a single jagged dimension encoded within a JaggedIntVar. Each instance contains the min and max value for the variable-length jagged dimension. It is also associated with the rank-1 offsets Tensor representing the layout of the jagged dimension within the JaggedIntVar. The offsets are associated with the JaggedDim instances after creation, while creating a jagged tensor with the make_jagged op.

See the docstring of the JaggedIntVar class for details.

Methods:

`max_value`()	The maximum possible value of the JaggedDim.
`min_value`()	The minimum possible value of the JaggedDim.
`offsets`()	The rank-1 offsets Tensor associated with the JaggedDim
`pseudo_code`([with_shape])	Returns a string containing pseudo code of this object.

max_value() → IntVar[source]: The maximum possible value of the JaggedDim.

min_value() → IntVar[source]: The minimum possible value of the JaggedDim.

offsets() → Optional[Tensor][source]: The rank-1 offsets Tensor associated with the JaggedDim

pseudo_code(with_shape=False) → str[source]

Returns a string containing pseudo code of this object.

Parameters:: with_shape (bool) – Marks whether to include shape info in the returned pseudo code.
Returns:: Pseudo code.
Return type:: str

class aitemplate.compiler.ops.JaggedIntVar(total_length: IntVar, batch_dim: IntVar, jagged_dims: List[JaggedDim])[source]

JaggedIntVar is a specific case of IntVar that encodes one or more jagged dimensions within itself. JaggedIntVar is used as the first dimension in jagged Tensors’ shape (this is, basically, what makes a Tensor jagged). E.g., a JaggedIntVar with a single JaggedDim represents a single dynamic dimension encoding a batch of variable sequence length. For the batch size of B, in some sources this is indicated as sum_B(N_B): the sum of individual sequence lengths: N_1, N_2, …, N_B of B sequences. This sum is represented as a single dynamic dimension: total_length, with B being defined by the batch_dim.

Because JaggedIntVar is an IntVar, it can be treated so by the AIT ops that are unaware of the jagged Tensor semantics. But the ops that are aware can interpret the JaggedIntVar as the first dimension of the jagged Tensor by specifically processing the underlying batch_dim and jagged_dims.

If there is more than one JaggedDim in a JaggedIntVar, those jagged dimensions are nested within the single dynamic dimension. E.g., if there are two JaggedDims, the JaggedIntVar represents a batch of B (batch_dim) variable-length sequences, each in turn consisting of variable-length sequences. In principle, the nesting can be arbitrarily deep, but in practice it’s usually just a single JaggedDim.

JaggedIntVar should not be created directly. Please use the make_jagged op for creating a jagged Tensor from a normal Tensor, the offsets, and the metadata (like batch_dim and jagged_dims). The make_jagged op creates the corresponding JaggedIntVar under the hood.

Methods:

`batch_dim`()	The batch_dim of the JaggedIntVar.
`get_max_dense_shape`()	Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to.
`jagged_dims`()	The jagged_dims of the JaggedIntVar.
`offsets_struct_type`()	The type of the offsets struct variable used in runtime.
`offsets_type`()	The type of the offsets of the JaggedIntVar's jagged_dims.
`offsets_var_name`()	The name of the offsets struct variable in runtime.
`total_length`()	The total_length dimension the JaggedIntVar is based on.

batch_dim() → IntVar[source]: The batch_dim of the JaggedIntVar.

get_max_dense_shape() → List[IntVar][source]: Returns a list of IntVars representing the maximum dense shape (rectangular volume) that the JaggedIntVar can correspond to. The result has the batch_dim as the first item and the IntImm with the max_value of each JaggedDim that follows.

jagged_dims() → List[JaggedDim][source]: The jagged_dims of the JaggedIntVar.

offsets_struct_type() → str[source]: The type of the offsets struct variable used in runtime.

offsets_type() → str[source]: The type of the offsets of the JaggedIntVar’s jagged_dims.

offsets_var_name() → str[source]: The name of the offsets struct variable in runtime.

total_length() → IntVar[source]: The total_length dimension the JaggedIntVar is based on.

class aitemplate.compiler.ops.argmax(dim=0)[source]

Returns the indices of the maximum value of all elements across a dimension in the input tensor. If there are multiple maximal values then the indices of the first maximal value are returned.

Parameters:

input (Tensor) – the source tensor
dim (int) – optional, the dimension to reduce. Default: 0

Returns:

a long tensor that contains the indices of the maximum values

Return type:

Tensor

Methods:

`gen_function`()	call backend function
`gen_profiler`([workdir, ...])	Generates source files for profiling purpose.
`profile`([workdir, devices, ...])	Get the Argmax Op workspace :param workdir: Base dir to keep profiling source codes, by default "./" :type workdir: str, optional :param devices: Devices used for profiling, by default device 0 will be used. :type devices: list, optional :param dynamic_profiling_strategy: A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used. :type dynamic_profiling_strategy: DynamicProfileStrategy, optional.

gen_function() → str[source]: call backend function

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]

Generates source files for profiling purpose.

Parameters:

workdir (str, optional) – The directory to generate source files.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=None)[source]

Get the Argmax Op workspace :param workdir: Base dir to keep profiling source codes, by default “./” :type workdir: str, optional :param devices: Devices used for profiling, by default device 0 will be used. :type devices: list, optional :param dynamic_profiling_strategy: A dynamic profiling strategy. By default MAX is used, i.e. to profile

a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.avg_pool2d(kernel_size, stride, pad)[source]

Applies a 2D average pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, H, W, C)\), output \((N, H_{out}, W_{out}, C)\) and kernel_size \((kH, kW)\) can be precisely described as:

\[out(N_i, C_j, h, w) = \frac{1}{kH * kW} \sum_{m=0}^{kH-1} \sum_{n=0}^{kW-1} input(N_i, C_j, stride[0] \times h + m, stride[1] \times w + n)\]

If pad is non-zero, then the input is implicitly zero-padded on both sides for pad number of points.

.attr.:kernel_size: the size of the window
.attr.:stride: the stride of the window
.attr.:pad: implicit zero padding to be added on both sides

Parameters:: input (Tensor [N, H, W, C]) – the input tensor.
Returns:: Tensor [N, H_out, W_out, C].

class aitemplate.compiler.ops.batch_gather[source]

Gathers values of the input tensor specified by indicies. Dim 0 of indicies correspond to the indices of input elements in dim 0.

Parameters:

input (Tensor) – the source tensor
indices (Tensor) – the indices of elements to gather

Returns:

the destination tensor

Return type:

Tensor

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.batch_layernorm_sigmoid_mul(normalized_shape: Optional[List[IntImm]] = None)[source]: batch_layernorm_sigmoid_mul op. This op expects the normalized_shape to be 1D.

class aitemplate.compiler.ops.batched_dense_vec_jagged_2d_mul[source]

Compute a dense tensor containing batched matrix multiplication of a batched dense vector and a batched jagged matrix.

Parameters:

vectors (Tensor) – batched dense vector of shape [B, H, N].
matrices (Tensor) – batched jagged matrix of shape [sum_B(N_B), H, D].

Returns:

dense tensor containing the batched vector / jagged matrix multiplication result of shape [B, H, D].

Return type:

output (Tensor)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.batched_nms(iou_threshold=0.5, keep_n=-1)[source]

Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU) in a batched fashion.

NMS iteratively removes lower scoring boxes which have an IoU greater than iou_threshold with another (higher scoring) box.

Note: if multiple boxes have the exact same score and satisfy the IoU criterion with respect to a reference box, the selected box is not guaranteed to be the same for different backends.

iouThreshold identifies the intersection-over-union (IoU) threshold which is used to discards all overlapping boxes with IoU > iouThreshold. By default 0.5.

keep_n identifies the number of boxes to return, by default -1 to return all.

Parameters:: boxes (Tensor[N, 4])) – are expected to be in (x1, y1, x2, y2) format with 0 <= x1 < x2 and 0 <= y1 < y2), and have been sorted in decreasing order of scores.
Returns:: “keep” (Tensor[N]) in which each element indicates if the corresponding box is removed (element=0) or not (element=1).
Return type:: Tensor

Methods:

gen_function()

call backend function

gen_function() → str[source]: call backend function

class aitemplate.compiler.ops.bert_embeddings[source]

Construct the embeddings from word, position and token_type embeddings.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.bmm_ccc[source]

Batch GEMM specialization for A[ColMajor], B[ColMajor], C[ColMajor].

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
YT = torch.bmm(XT, W_pt.transpose(2, 1))
Y_pt = torch.transpose(YT, 2, 1)

class aitemplate.compiler.ops.bmm_ccc_add[source]

Batch GEMM specialization for A[ColMajor], B[ColMajor], C[ColMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()
D_pt = torch.randn(B, N, M).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
WT = torch.transpose(W_pt, 2, 1)
YT = torch.bmm(XT, WT)
Y_pt = YT.transpose(2, 1) + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, K, M)
bTensor: Tensor in shape (B, N, K)
cTensor: Tensor in shape (B, N, M)

Tensor: Tensor in shape (B, N, M)

class aitemplate.compiler.ops.bmm_ccr[source]

Batch GEMM specialization for A[ColMajor], B[ColMajor], C[RowMajor].

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
Y_pt = torch.bmm(XT, W_pt.transpose(2, 1))

class aitemplate.compiler.ops.bmm_ccr_add[source]

Batch GEMM specialization for A[ColMajor], B[ColMajor], C[RowMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()
D_pt = torch.randn(B, M, N).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
WT = torch.transpose(W_pt, 2, 1)
Y_pt = torch.bmm(XT, WT)
Y_pt = Y_pt + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, K, M)
bTensor: Tensor in shape (B, N, K)
cTensor: Tensor in shape (B, M, N)

Tensor: Tensor in shape (B, M, N)

class aitemplate.compiler.ops.bmm_crc[source]

Batch GEMM specialization for A[ColMajor], B[RowMajor], C[ColMajor].

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
YT = torch.bmm(XT, W_pt)
Y_pt = torch.transpose(YT, 2, 1)

class aitemplate.compiler.ops.bmm_crc_add[source]

Batch GEMM specialization for A[ColMajor], B[RowMajor], C[ColMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()
D_pt = torch.randn(B, N, M).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
YT = torch.bmm(XT, W_pt)
Y_pt = YT.transpose(2, 1) + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, K, M)
bTensor: Tensor in shape (B, K, N)
cTensor: Tensor in shape (B, N, M)

Tensor: Tensor in shape (B, N, M)

class aitemplate.compiler.ops.bmm_crr[source]

Batch GEMM specialization for A[ColMajor], B[RowMajor], C[RowMajor].

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
Y_pt = torch.bmm(XT, W_pt)

class aitemplate.compiler.ops.bmm_crr_add[source]

Batch GEMM specialization for A[ColMajor], B[RowMajor], C[RowMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()
D_pt = torch.randn(B, M, N).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
Y_pt = torch.bmm(XT, W_pt)
Y_pt = Y_pt + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, K, M)
bTensor: Tensor in shape (B, K, N)
cTensor: Tensor in shape (B, M, N)

Tensor: Tensor in shape (B, M, N)

class aitemplate.compiler.ops.bmm_rcc[source]

Batch GEMM specialization for A[RowMajor], B[ColMajor], C[ColMajor].

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
YT = torch.bmm(XT, W_pt)
Y_pt = torch.transpose(YT, 2, 1)

class aitemplate.compiler.ops.bmm_rcc_add[source]

Batch GEMM specialization for A[RowMajor], B[ColMajor], C[ColMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()
D_pt = torch.randn(B, N, M).cuda().half()

WT = torch.transpose(W_pt, 2, 1)
YT = torch.bmm(X_pt, WT)
Y_pt = YT.transpose(2, 1) + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, M, K)
bTensor: Tensor in shape (B, N, K)
cTensor: Tensor in shape (B, N, M)

Tensor: Tensor in shape (B, N, M)

class aitemplate.compiler.ops.bmm_rcr[source]

Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor].

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()

XT = torch.transpose(X_pt, 2, 1)
Y_pt = torch.bmm(XT, W_pt)

class aitemplate.compiler.ops.bmm_rcr_add[source]

Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, N, K).cuda().half()
D_pt = torch.randn(B, M, N).cuda().half()

WT = torch.transpose(W_pt, 2, 1)
Y_pt = torch.bmm(X_pt, WT)
Y_pt = Y_pt + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor in shape (B, M, K)
bTensor: Tensor in shape (B, N, K)
cTensor: Tensor in shape (B, M, N)

Tensor: Tensor in shape (B, M, N)

class aitemplate.compiler.ops.bmm_rcr_n1[source]

Methods:

`gen_profiler`([workdir, ...])	This kernel doesn't require profiling.
`is_valid_shape`(a, b)	Check input a/b is valid for bmm_rcr_n1. Requirements: 1) matching dimension of a/b (where a is row major, b is column major) 2) dim N of b needs to be 1 3) dim K of b needs to be multiple of 8.

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]: This kernel doesn’t require profiling.

static is_valid_shape(a: Tensor, b: Tensor)[source]

Check input a/b is valid for bmm_rcr_n1. Requirements:

matching dimension of a/b (where a is row major, b is column major)

dim N of b needs to be 1

dim K of b needs to be multiple of 8

class aitemplate.compiler.ops.bmm_rcr_permute(shape: Tuple[int], layout='0213')[source]

Batch GEMM specialization for A[RowMajor], B[ColMajor], C[RowMajor] with permutation on output to given layout.

Currently only supports reshape to 4D tensor, then do 0213 permute

This operator is equivalent to following PyTorch code:

class aitemplate.compiler.ops.bmm_rcr_softmax[source]: BatchGemm with softmax, A: row_major, B: column_major, C: row_major, A: [b, m, k], B: [b, n, k], C: [b, m, n]

class aitemplate.compiler.ops.bmm_rrc[source]

Batch GEMM specialization for A[RowMajor], B[RowMajor], C[ColMajor]

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()

YT = torch.bmm(X_pt, W_pt)
Y_pt = torch.transpose(YT, 2, 1)

class aitemplate.compiler.ops.bmm_rrc_add[source]

Batch GEMM specialization for A[RowMajor], B[RowMajor], C[ColMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()
D_pt = torch.randn(B, N, M).cuda().half()
YT = torch.bmm(X_pt, W_pt)
Y_pt = YT.transpose(2, 1) + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor with shape (B, M, K)
bTensor: Tensor with shape (B, K, N)
cTensor: Tensor with shape (B, N, M)

Tensor: Tensor with shape (B, N, M)

class aitemplate.compiler.ops.bmm_rrr[source]

Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor]

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()

Y_pt = torch.bmm(X_pt, W_pt)

class aitemplate.compiler.ops.bmm_rrr_add[source]

Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor] with Add. C can be the same size as the output or be broadcast as bias.

This operator is equivalent to the following PyTorch code:

X_pt = torch.randn(B, M, K).cuda().half()
W_pt = torch.randn(B, K, N).cuda().half()
D_pt = torch.randn(B, M, N).cuda().half()

Y_pt = torch.bmm(X_pt, W_pt) + D_pt

__call__(a: Tensor, b: Tensor, c: Tensor) -> Tensor:

aTensor: Tensor with shape (B, M, K)
bTensor: Tensor with shape (B, K, N)
cTensor: Tensor with shape (B, M, N)

Tensor: Tensor with shape (B, M, N)

class aitemplate.compiler.ops.bmm_rrr_k1_tanh[source]

Methods:

gen_profiler([workdir, ...])

This kernel does not require profiling.

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]: This kernel does not require profiling.

class aitemplate.compiler.ops.bmm_rrr_permute(shape: Tuple[int], layout='0213')[source]

Batch GEMM specialization for A[RowMajor], B[RowMajor], C[RowMajor] with permutation on output to given layout.

Currently only supports reshape to 4D tensor, then do 0213 permute

This operator is equivalent to following PyTorch code:

class aitemplate.compiler.ops.bmm_softmax_bmm(scale=1.0)[source]

BMM_RCR + Softmax + BMM_RRR Specialization This fusion is commonly used in Attention family

This op is equivalent to the following PyTorch code:

Q = torch.randn(B, M, K).cuda().half()
K = torch.randn(B, N, K).cuda().half()
V = torch.randn(B, N, O).cuda().half()

attn = torch.bmm(Q, K.transpose(1, 2)) * scale
attn = torch.softmax(attn, dim=-1)
score = torch.bmm(attn, V)

Limitations: 1. Output dim O should be smaller than 256. 2. CUDA backend codegen is not implemented in this release.

class aitemplate.compiler.ops.bmm_softmax_bmm_permute(shape: Tuple[int], scale=1.0, causal=False, layout='0213')[source]

BMM_RCR + Softmax + BMM_RRR + Permute Specialization This fusion is commonly used in Attention family

This op is equivalent to the following PyTorch code:

Q = torch.randn(B, M, K).cuda().half()
K = torch.randn(B, N, K).cuda().half()
V = torch.randn(B, N, O).cuda().half()

attn = torch.bmm(Q, K.transpose(1, 2)) * scale
attn = torch.softmax(attn, dim=-1)
score = torch.bmm(attn, V)
score_reshape = score.reshape(B // num_heads, num_heads, M, O)
score_permute = torch.permute(score_reshape, [0, 2, 1, 3])

Limitations: 1. Output dim O should be smaller than 256. 2. CUDA backend codegen is not implemented in this release.

class aitemplate.compiler.ops.cast[source]

Returns the cast of input tensor to specified type. Only the conversion between any pair of float16, bfloat16, and float32 dtypes is supported.

Parameters:

x (Tensor) – the source tensor
dtype (str) – the target type for the cast operator

Returns:

a tensor with the type converted to the specified dtype.

Return type:

Tensor

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.chunk[source]

Attempts to split a tensor into the specified number of chunks

Parameters:

input (Tensor) – the tensor to split
chunks (int) – number of chunks to return. Must be >= 1
dim (int) – optional, axes along which to split the tensor, by default 0

Returns :: List[Tensor]: If the tensor size along the given dimesion dim is divisible by chunks, all returned chunks will be the same size. If the tensor size along the given dimension dim is not divisible by chunks, all returned chunks will be the same size, except the last one. If such division is not possible, this function may return less than the specified number of chunks.

class aitemplate.compiler.ops.clamp[source]: Clamps all elements in input into the range [min_value, max_value]. Returns y = min(max(x, min_value), max_value). If min is None, there is no lower bound. Or, if max is None there is no upper bound. If min is greater than max torch.clamp(…, min, max) sets all elements in input to the value of max.

class aitemplate.compiler.ops.classic_b2b_bmm(causal_type: CausalType, epilogue_math_name: str, alpha0: float, alpha1: float, alpha1_divide_by_seq_len: bool = False)[source]

Methods:

gen_function()

call backend functions

gen_function() → str[source]: call backend functions

class aitemplate.compiler.ops.concatenate(fast_cat=True)[source]

Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty. It is the inverse operation for split and chunk.

Parameters:

inputs (List[Tensor]) – the sequence of input tensors to concatenate
dim (int) – the dimension to concatenate. Optional, 0 by default

Returns:

the output tensor

Return type:

Tensor

Methods:

`check_rank`(inputs, dim)	check if the rank is valid
`gen_function`()	Generates function source code string.
`get_first_non_empty_input_if_any`(inputs)	Return the first non-empty input and its index from the list.
`get_original_index`(idx)	Return the original index of the input at idx in the current "inputs" list.
`get_tensor_index`(tensor)	Return the index for the input tensor in the "inputs" list.
`remove_input_at`(indices)	This function removes the inputs in indices from the "inputs" attribute and sets input_masks[indices] to be False.

static check_rank(inputs: List[Tensor], dim) → bool[source]: check if the rank is valid

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

static get_first_non_empty_input_if_any(inputs: List[Tensor]) → Tuple[Tensor, int][source]: Return the first non-empty input and its index from the list. If all inputs are empty, return the first input.

get_original_index(idx: int) → int[source]

Return the original index of the input at idx in the current “inputs” list.

Parameters:: idx (int) – the index of an input based on the current “inputs”
Returns:: the index of this input in the “original_inputs”
Return type:: int

get_tensor_index(tensor: Tensor) → int[source]

Return the index for the input tensor in the “inputs” list.

Parameters:: tensor (Tensor) – the input tensor for looking up the index
Returns:: the index of this input in the “nputs” list
Return type:: int

remove_input_at(indices: Union[int, Sequence[int]]) → None[source]

This function removes the inputs in indices from the “inputs” attribute and sets input_masks[indices] to be False. Note that the indices are based on the current “inputs”.

Parameters:: indices (Union[int, Sequence[int]]) – the index of an input or indices of multiple inputs based on the current “inputs”
Return type:: None

class aitemplate.compiler.ops.concatenate_tanh[source]: The fusion of concatenate and tanh.

class aitemplate.compiler.ops.conv2d(stride, pad, dilate=1, group=1)[source]

Applies a 2D convolution on input with size (N, H, W, C_in), and produces output with size (N, H_out, W_out, C_out) where N is batch size, H, W are the height and width of the image in pixels, and C is the number of channels.

In the simplest case, the output value of the layer with input size \((N, H, W, C_{\text{in}})\) and output \((N, H_{\text{out}}, W_{\text{out}}, C_{\text{out}})\) can be precisely described as:

\[\text{out}(N_i, C_{\text{out}_j}) = \text{bias}(C_{\text{out}_j}) + \sum_{k = 0}^{C_{\text{in}} - 1} \text{weight}(C_{\text{out}_j}, k) \star \text{input}(N_i, k)\]

where \(\star\) is the valid 2D cross-correlation operator.

stride controls the stride for the cross-correlation.
pad controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
dilate controls the spacing between the kernel points; also known as the à trous algorithm. It is harder to describe, but the link here has a nice visualization of what dilation does.
group controls the number of blocked connections from input channels to output channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
OP = aitemplate.compiler.ops.conv2d(stride=1, pad=1, dilate=1)
Y = OP(X, W)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)

Y_pt = torch.nn.functional.conv2d(X_pt, W_pt)
Y = NCHW2NHWC(Y_pt)

Methods:

`gen_function`()	Generates function source code string.
`gen_profiler`([workdir, ...])	Profiler generator.
`profile`([workdir, devices, ...])	Selects the fastest kernel configurations.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Profiler generator.

Parameters:

workdir (str, optional, by default None) –
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – The directory which contains source files, by default “./”
devices (list, optional) – A list of device ids which can be used for profiling.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – Profiling strategy used when there are dynamic dims. By default, MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.conv2d_bias(stride, pad, dilate=1, group=1)[source]

Conv2d with bias.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out) produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\). Default: None

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias(stride=1, pad=1, dilate=1)
Y = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)

Y_pt = torch.nn.functional.conv2d(X_pt, W_pt, bias=B)
Y = NCHW2NHWC(Y_pt)

class aitemplate.compiler.ops.conv2d_bias_add(stride, pad, dilate=1, group=1)[source]

Conv2d_bias_add.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), adds the residual in shape (N, H_out, W_out, C_out), produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)
residual – residual to add after conv2d_bias

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
R = Tensor(shape=[N, H_out, W_out, C_out], dtype="float16", name="residual", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_add(stride=1, pad=1, dilate=1)
Y = OP(X, W, B, R)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)
R_pt = NHWC2NCHW(R_ait)

Y_pt = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Z_pt = Y_pt + R_pt
Result_pt = Z_pt
Result = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_bias_add_hardswish(stride, pad, dilate=1, group=1)[source]

Conv2d_bias_add_hardswish.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), adds the residual in shape (N, H_out, W_out, C_out), performs hardswish operation and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)
residual – residual to add after conv2d_bias

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
R = Tensor(shape=[N, H_out, W_out, C_out], dtype="float16", name="residual", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_add_hardswish(stride=1, pad=1, dilate=1)
Y = OP(X, W, B, R)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)
R_pt = NHWC2NCHW(R_ait)

Y_pt = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Z_pt = Y_pt + R_pt
Result_pt = torch.nn.functional.hardswish(Z_pt)
Result = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_bias_add_relu(stride, pad, dilate=1, group=1)[source]

Conv2d_bias_add_relu.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), adds the residual in shape (N, H_out, W_out, C_out), performs relu operation and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)
residual – residual to add after conv2d_bias

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
R = Tensor(shape=[N, H_out, W_out, C_out], dtype="float16", name="residual", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_add_relu(stride=1, pad=1, dilate=1)
Y = OP(X, W, B, R)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)
R_pt = NHWC2NCHW(R_ait)

Y_pt = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Z_pt = Y_pt + R_pt
Result_pt = torch.nn.functional.relu(Z_pt)
Result = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_bias_few_channels(stride, pad, dilate=1, auto_padding=True)[source]

conv2d_bias_few_channels.

This operator equals to conv2d_bias but has improved performance for in_channels < 8.

class aitemplate.compiler.ops.conv2d_bias_hardswish(stride, pad, dilate=1, group=1)[source]

Conv2d with bias + hardswish.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), performs hardswish and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="weight", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_hardswish(stride=1, pad=1, dilate=1)
Result_ait = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)

Y = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Result_pt = torch.nn.functional.hardswish(Y)
Result_ait = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_bias_hardswish_few_channels(stride, pad, dilate=1, auto_padding=True)[source]

conv2d_bias_hardswish_few_channels.

This operator equals to conv2d_bias_hardswish but has imporved performance for in_channels < 8.

class aitemplate.compiler.ops.conv2d_bias_relu(stride, pad, dilate=1, group=1)[source]

Conv2d with bias + relu.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), performs relu and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="weight", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_relu(stride=1, pad=1, dilate=1)
Result_ait = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)

Y = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Result_pt = torch.nn.functional.relu(Y)
Result_ait = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_bias_relu_few_channels(stride, pad, dilate=1, auto_padding=True)[source]

conv2d_bias_relu_few_channels.

This operator equals to conv2d_bias_relu but has imporved performance for in_channels < 8.

class aitemplate.compiler.ops.conv2d_bias_sigmoid(stride, pad, dilate=1, group=1)[source]

Conv2d with bias + sigmoid.

Applies a 2D convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), performs sigmoid and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias – optional bias tensor of shape \((\text{out\_channels})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="weight", is_input=True)
OP = aitemplate.compiler.ops.conv2d_bias_sigmoid(stride=1, pad=1, dilate=1)
Result_ait = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)

Y = torch.nn.functional.conv2d(X_pt, W_pt, bias=B_pt)
Result_pt = torch.sigmoid(Y)
Result_ait = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.conv2d_depthwise(stride, pad, dilate=1, group=1)[source]: Base class of conv2d with groups.

class aitemplate.compiler.ops.conv2d_depthwise_bias(stride, pad, dilate=1, group=1)[source]: Base class of conv2d with groups.

class aitemplate.compiler.ops.conv3d(stride, pad, dilate=1, group=1)[source]

Methods:

`gen_function`()	Generates function source code string.
`gen_profiler`([workdir, ...])	Profiler generator.
`profile`([workdir, devices, ...])	Selects the fastest kernel configurations.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Profiler generator.

Parameters:

workdir (str, optional, by default None) –
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – The directory which contains source files, by default “./”
devices (list, optional) – A list of device ids which can be used for profiling.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – Profiling strategy used when there are dynamic dims. By default, MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.conv3d_bias(stride, pad, dilate=1, group=1)[source]

aitemplate.compiler.ops.convert_shape_to_IntVar(shape)[source]: Helper function to convert a list of mixed int/IntVar/IntImm into a list with only IntVar/IntImm.

class aitemplate.compiler.ops.depthwise_conv3d(stride, pad, dilate=1, group=1, bias=False)[source]

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.dual_bmm_rrr_div[source]

Batch GEMM specialization: BMM_RRR(A, B0) / BMM_RRR(A, B1)

This operator is equivalent to the following pytorch code:

If the last dim of B1 is 1 (while the last dim of B0 isn’t), B1 is broadcasted to the same shape as B0 before computing the right gemm A @ B1.

class aitemplate.compiler.ops.dual_gemm_rcr_fast_gelu[source]

GEMM Specialization: FAST_GELU(GEMM_RCR(A, B)) * GEMM_RCR(A, B1)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.dual_gemm_rcr_silu[source]

GEMM Specialization: SILU(GEMM_RCR(A, B)) * GEMM_RCR(A, B1)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.dynamic_slice[source]

Cut the source tensor into slices specified by a list of start indices and a list of end indices.

Parameters:

x (Tensor) – input tensor
start_indices (List[int]) – similar to PyTorch and numpy, indices can be negative
end_indices (List[int]) – end_index is not included. Similar to PyTorch and numpy, indices can be negative.

Returns:

the list of sliced tensors.

Return type:

List[Tensor]

Methods:

`gen_function`()	Generates function source code string.
`normalize_start_end_indices`(dim_val, start, end)	return normalized start and end indices which fall into a well-formed range like below: 0 <= start <= end <= dim_val

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

static normalize_start_end_indices(dim_val: int, start: int, end: int) → List[int][source]: return normalized start and end indices which fall into a well-formed range like below: 0 <= start <= end <= dim_val

class aitemplate.compiler.ops.efficient_nms(preNmsTop=2000, nmsMaxOut=200, iouThreshold=0.5, minBoxSize=0)[source]

Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU).

NMS iteratively removes lower scoring boxes which have an IoU greater than iou_threshold with another (higher scoring) box.

Note: if multiple boxes have the exact same score and satisfy the IoU criterion with respect to a reference box, the selected box is not guaranteed to be the same for different backends.

preNmsTop identifies the maximum number of boxes to take.

nmsMaxOut identifies the maximum number of boxes to reserve after the operation.

iouThreshold identifies the intersection-over-union (IoU) threshold which is used to discards all overlapping boxes with IoU > iouThreshold.

minBoxSize identifies the minimum box size, if a box has size less than this value, it will be removed before the non-maximum suppression.

Parameters:

boxes (Tensor[N, 4])) – boxes to perform NMS on. They are expected to be in (x1, y1, x2, y2) format with 0 <= x1 < x2 and 0 <= y1 < y2.
scores (Tensor[N]) – scores for each one of the boxes

Returns:

int64 tensor with the indices of the elements that have been kept by NMS, sorted in decreasing order of scores

Return type:

Tensor

Methods:

`gen_function`()	call backend functions
`gen_profiler`([workdir, ...])	Generates source files for profiling purpose.
`profile`([workdir, devices, ...])	Profile to compute the NMS Op workspace size.

gen_function() → str[source]: call backend functions

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]

Generates source files for profiling purpose.

Parameters:

workdir (str, optional) – The directory to generate source files.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=None)[source]: Profile to compute the NMS Op workspace size.

class aitemplate.compiler.ops.elementwise(func_enum: FuncEnum)[source]

elementwise operator definition.

Methods:

replace_input_tensor(old_tensor, new_tensor)

Replaces old_tensors in self._attrs["inputs"] with new_tensor.

replace_input_tensor(old_tensor, new_tensor) → None[source]

Replaces old_tensors in self._attrs[“inputs”] with new_tensor.

Parameters:

old_tensor (Tensor) – The old tensor to be replaced.
new_tensor (Tensor) – The new tensor.

Return type:

None.

class aitemplate.compiler.ops.eq[source]

class aitemplate.compiler.ops.expand[source]

Expands a tensor’s singleton dimensions.

Expanded dimensions in the input tensor must be `IntImm`s with value() == 1, or `IntVar`s with upper_bound() == lower_bound() == 1. The output shape may be dynamic.

The other dimensions in the input must match the input shape exactly, or be set to -1, in which case the output shape is unchanged for that dimension.

Tensor can be also expanded to a larger number of dimensions, and the new ones will be appended at the front. For the new dimensions, the size cannot be set to -1.

Parameters:

input (Tensor) – the source tensor
shape (List[Union[IntImm, IntVar, int]]) – target shape ( dimensions with size -1 will be kept, excess dimensions are added at the front )
index_type (str) – Native type used for indices, may be “int64” (default) or “int32”. Pick “int32” only if the total number of elements is lower than 2^31
optimize_fixed_dims (bool) – if True, and if the conditions are given, allow to apply optimizatins assuming mostly fixed shapes.

Returns:

the destination tensor

Return type:

Tensor

Example:

x = Tensor([2, 3], name="input_0", is_input=True)
y = Tensor([2, 3], name="input_1", is_input=True)
x_expand = ops.expand()(x, [IntImm(1), -1, -1])
y_expand = ops.expand()(y, [IntVar([1, 1]), -1, -1])
z = ops.elementwise(FuncEnum.MUL)(x_expand, y_expand)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.flash_attention(batch_size, dropout, max_seq_len, causal)[source]

FlashAttention provides an implementation for fused multi-head attention module:

\[\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK}{\sqrt(d)}) * V\]

\[\text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O\]

where \(head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\).

Methods:

gen_function()

call backend functions

gen_function() → str[source]: call backend functions

class aitemplate.compiler.ops.flatten(start_dim=0, end_dim=-1)[source]

Flattens input by reshaping it into a one-dimensional tensor. If start_dim or end_dim are passed, only dimensions starting with start_dim and ending with end_dim are flattened. The order of elements in input is unchanged.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.fmha_style_b2b_bmm(causal_type: CausalType, epilogue_math_name: str, alpha0: float, alpha1: float, alpha1_divide_by_seq_len: bool = False)[source]

See comments at the head of this file.

Methods:

gen_function()

call backend functions

gen_function() → str[source]: call backend functions

class aitemplate.compiler.ops.full[source]

Creates a tensor of a given shape and dtype filled with the specified fill_value (float scalar).

Parameters:

shape (int or IntVar or List[IntVar]) – the shape of the output Tensor.
fill_value (int or float) – the value to fill the output Tensor with.
dtype (str) – the dtype of the output Tensor.

Returns:

a tensor of shape and dtype filled with fill_value.

Return type:

Tensor

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.fused_elementwise(elementwise_ops: List[elementwise], inputs: Iterable[Operator], outputs: Iterable[Operator])[source]

fused_elementwise operator is used internally. It’s the actual operator which does ++ codegen.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.gather[source]

gather implementation

Parameters:: Operator ([type]) – [description]

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.ge[source]

class aitemplate.compiler.ops.gemm_rcr[source]

GEMM Specialization for A[RowMajor], B[ColMajor], C[RowMajor]

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias[source]

GEMM Specialization: GEMM_RCR(A, B) + Bias A[RowMajor], B[ColMajor], Bias[RowMajor], C[RowMajor]

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_add[source]

GEMM Specialization: GEMM_RCR(A, B) + Bias + D0

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = linear + D0

class aitemplate.compiler.ops.gemm_rcr_bias_add_add[source]

GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0 + D1)

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()
D1 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = linear + D0 + D1

class aitemplate.compiler.ops.gemm_rcr_bias_add_add_relu[source]

GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0 + D1)

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()
D1 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = torch.nn.ReLU(linear + D0 + D1)

class aitemplate.compiler.ops.gemm_rcr_bias_add_relu[source]

GEMM Specialization: RELU(GEMM_RCR(A, B) + Bias + D0)

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = torch.nn.ReLU(linear + D0)

class aitemplate.compiler.ops.gemm_rcr_bias_fast_gelu[source]

GEMM Specialization: FastGELU(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_gelu[source]

GEMM Specialization: GELU(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_hardswish[source]

GEMM Specialization: HardSwish(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_mul[source]

GEMM Specialization: (GEMM_RCR(A, B) + Bias) * D0

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = linear * D0

class aitemplate.compiler.ops.gemm_rcr_bias_mul_add[source]

GEMM Specialization: (GEMM_RCR(A, B) + Bias) * D0 + D1

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()
D1 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = linear * D0 + D1

class aitemplate.compiler.ops.gemm_rcr_bias_mul_tanh[source]

GEMM Specialization: TANH((GEMM_RCR(A, B) + Bias) * D0)

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = torch.tanh(linear * D0)

class aitemplate.compiler.ops.gemm_rcr_bias_permute(shape: Tuple[int], layout='20314')[source]

class aitemplate.compiler.ops.gemm_rcr_bias_relu[source]

GEMM Specialization: ReLU(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_sigmoid[source]

GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_sigmoid_mul[source]

GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias) * D0

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = torch.sigmoid(linear) * D0

class aitemplate.compiler.ops.gemm_rcr_bias_sigmoid_mul_tanh[source]

GEMM Specialization: Tanh(Sigmoid(GEMM_RCR(A, B) + Bias) * D0)

This operator is equivalent to the following pytorch code:

A = torch.randn(M, K).cuda().half()
B = torch.randn(N, K).cuda().half()
Bias = torch.randn(N).cuda().half()
D0 = torch.randn(M, N).cuda().half()

linear = torch.nn.functional.linear(A, B, bias=Bias)
y = torch.tanh(torch.sigmoid(linear) * D0)

class aitemplate.compiler.ops.gemm_rcr_bias_softmax[source]: gemm_rcr_bias_softmax operator.

class aitemplate.compiler.ops.gemm_rcr_bias_swish[source]

GEMM Specialization: SiLU(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_bias_tanh[source]

GEMM Specialization: Tanh(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_fast_gelu[source]

GEMM Specialization: FastGELU(GEMM_RCR(A, B))

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rcr_permute(shape: Tuple[int], layout='20314')[source]

class aitemplate.compiler.ops.gemm_rcr_permute_elup1(*args, **kwargs)[source]

class aitemplate.compiler.ops.gemm_rcr_softmax[source]: gemm_rcr_softmax operator.

class aitemplate.compiler.ops.gemm_rrr[source]

GEMM Specialization for A[RowMajor], B[RowMajor], C[RowMajor]

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rrr_bias[source]

GEMM Specialization: GEMM_RRR(A, B) + Bias A[RowMajor], B[RowMajor], Bias[RowMajor], C[RowMajor]

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.gemm_rrr_bias_permute(shape: Tuple[int], layout='20314')[source]

class aitemplate.compiler.ops.gemm_rrr_permute(shape: Tuple[int], layout='20314')[source]

class aitemplate.compiler.ops.gemm_rrr_small_nk[source]

Special gemm kernel for small K and N (K <= 8, N <= 8) A: [M, K] B: [K, N] C: [M, N]

Methods:

gen_profiler([workdir, ...])

This kernel does not require profiling

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]: This kernel does not require profiling

aitemplate.compiler.ops.gen_int_var_min_max(values: List[int], name: Optional[str] = None, symbolic_value: Optional[Basic] = None)[source]: A helper function to generate IntImm or IntVar depending on the length of values. Only keeps [min, max] pairs if there are more than 2 values.

class aitemplate.compiler.ops.getitem[source]: Retrieve a single element from a list of tuple at a certain index.

class aitemplate.compiler.ops.group_gemm_rcr[source]

Grouped GEMM Specialization: GEMM_RCR(A, B)

This operator is equivalent to the following pytorch code:

Methods:

`gen_function`()	Generate function for the op
`gen_profiler`([workdir, ...])	Generate profiler for the op

gen_function() → str[source]

Generate function for the op

Returns:: C++ source code of the function
Return type:: str

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]

Generate profiler for the op

Parameters:

workdir (str, optional) – [description], by default None
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

class aitemplate.compiler.ops.group_gemm_rcr_bias[source]

Grouped GEMM Specialization: GEMM_RCR(A, B) + Bias

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.group_gemm_rcr_bias_relu[source]

Grouped GEMM Specialization: ReLU(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.group_gemm_rcr_bias_sigmoid[source]

Grouped GEMM Specialization: Sigmoid(GEMM_RCR(A, B) + Bias)

This operator is equivalent to the following pytorch code:

class aitemplate.compiler.ops.group_layernorm(normalized_shape: Optional[List[List[IntImm]]] = None)[source]

group_layernorm. For each group, we expect each input to have shapes:

Input shape: [M0, M1, …, Mp, N1, N2, …, ND] Normalized_shape: [N1, N2, …, ND] Gamma/Beta, if not None, have the same shape as normalized_shape.

Every input in the groups must have the same [M0, M1, …, Mp] dims.

class aitemplate.compiler.ops.group_layernorm_sigmoid_mul(normalized_shape: Optional[List[List[IntImm]]] = None)[source]

group_layernorm_sigmoid_mul. For each group, we expect each input to have shapes:

Input shape: [M0, M1, …, Mp, N1, N2, …, ND] Normalized_shape: [N1, N2, …, ND] Gamma/Beta, if not None, have the same shape as normalized_shape.

Every input in the groups must have the same [M0, M1, …, Mp] dims.

class aitemplate.compiler.ops.group_norm(num_groups: int, num_channels: int)[source]

Standalone group norm op. The grouped dim must be the last dim of the input tensor.

Methods:

`gen_function`()	Generates function source code string.
`gen_profiler`([workdir, ...])	Generator profiler.
`get_input_shapes`(x, gamma, beta)	Return a list of shapes for x, gamma and beta, where gamma_shape and beta_shape may be None if gamma and beta are None, respectively.
`profile`([workdir, devices, ...])	Selects the fastest kernel configurations.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Generator profiler. The profiler files are standalone executable for profiling.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

static get_input_shapes(x, gamma, beta) → List[List[Union[IntVar, IntImm]]][source]: Return a list of shapes for x, gamma and beta, where gamma_shape and beta_shape may be None if gamma and beta are None, respectively.

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.MAX)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
devices (list, optional) – Devices used for profiling, by default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.group_norm_swish(num_groups: int, num_channels: int)[source]: Standalone group norm op. The grouped dim must be the last dim of the input tensor.

class aitemplate.compiler.ops.grouped_classic_b2b_bmm(causal_type: CausalType, epilogue_math_name: str, alpha0: float, alpha1: float, alpha1_divide_by_seq_len: bool = False)[source]

Methods:

gen_function()

call backend functions

gen_function() → str[source]: call backend functions

class aitemplate.compiler.ops.grouped_fmha_style_b2b_bmm(causal_type: CausalType, epilogue_math_name: str, alpha0: float, alpha1: float, alpha1_divide_by_seq_len: bool = False)[source]: See comments at the head of this file.

class aitemplate.compiler.ops.gt[source]

class aitemplate.compiler.ops.identity[source]

Returns the input tensor. This could be useful for only name changes etc.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.index_select(dim=0)[source]

Returns a new tensor which indexes the input tensor along dimension dim using the entries in index which is a LongTensor.

The returned tensor has the same number of dimensions as the original tensor (input). The dimth dimension has the same size as the length of index; other dimensions have the same size as in the original tensor.

Parameters:

input (Tensor) –
dim (int) –
index (IntTensor or LongTensor) –

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.int_elementwise(func_enum: FuncEnum)[source]

int elementwise operator definition.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

aitemplate.compiler.ops.is_symbolic(sym_val: Any) → bool[source]: Check whether sym_val is a sympy class.

class aitemplate.compiler.ops.jagged_lengths_to_offsets[source]

Given a 1D Tensor of lengths of the sequences in a jagged Tensor, returns the corresponding 1D Tensor of offsets. The latter is the inclusive sum of the lengths prepended by a zero.

Parameters:: lengths (Tensor) – 1D Tensor of sequence lengths, [B]-shaped.
Returns:: 1D Tensor of sequence offsets, [B+1]-shaped.
Return type:: offsets (Tensor)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.jagged_lengths_to_presences[source]

Given a 1D Tensor of lengths of the sequences in a jagged Tensor, returns a 2D Tensor of presences indicating where the data exists and where not. The dtype of presences Tensor is configurable.

Parameters:

lengths (Tensor) – 1D Tensor of sequence lengths, [B]-shaped.
max_seq_len (int) – Maximum possible sequence length.

Returns:

2D Tensor of presences, [B, max_seq_len]-shaped.: presences[i, j] = (dtype)(j < lenghts[i])

Return type:

presences (Tensor)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.jagged_to_padded_dense(padding_value: float = 0)[source]

Returns a dense Tensor “expanded” from the input jagged Tensor. For each of the jagged dimensions (JaggedDims) in the jagged Tensor’s first dimension (JaggedIntVar), a separate static dimension (IntImm) equal to the max_value of the jagged dimension is created in the output dense Tensor’s shape.

The values in the output dense Tensor that don’t have corresponding values in the input jagged Tensor are set to the padding_value.

Parameters:

x (Tensor) – input jagged Tensor.
padding_value (float) – the padding value for the output dense Tensor’s elements that don’t have counterparts in the input jagged Tensor.

Returns:

a dense Tensor expanded from the input jagged Tensor x.

Return type:

y (Tensor)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.layernorm(normalized_shape: Optional[List[IntImm]] = None)[source]

Standalone layernorm op. Applies Layer Normalization over a mini-batch of inputs as described in the paper Layer Normalization. The mean and standard-deviation are calculated over the last D dimensions, where D is the dimension of normalized_shape. Input shape: [M0, M1, …, Mp, N1, N2, …, ND] Normalized_shape: [N1, N2, …, ND] Gamma/Beta, if not None, have the same shape as normalized_shape.

Methods:

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Generator profiler. The profiler files are standalone executable for profiling.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

static get_input_shapes(x, gamma, beta) → List[List[Union[IntVar, IntImm]]][source]: Return a list of shapes for x, gamma and beta, where gamma_shape and beta_shape may be None if gamma and beta are None, respectively.

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.MAX)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
devices (list, optional) – Devices used for profiling, by default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.layernorm_sigmoid_mul(layer_norm: Operator, sigmoid: Operator, mul: Operator)[source]

Fused layernorm_sigmoid_mul op Input shape: [M0, M1, …, Mp, N1, N2, …, ND] Normalized_shape: [N1, N2, …, ND] Gamma/Beta, if not None, have the same shape as normalized_shape.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.le[source]

class aitemplate.compiler.ops.list_construct[source]: Construct a list of tensors.

class aitemplate.compiler.ops.lt[source]

class aitemplate.compiler.ops.make_jagged(batch_dim: IntVar, jagged_dims: List[JaggedDim], check_sequence_lengths: bool = True)[source]

Creates jagged Tensors from normal Tensors, offsets, and metadata.

Jagged Tensors are normal Tensors with the first dynamic dimensions represented with a JaggedIntVar instance (as opposed to a vanilla IntVar). The purpose of this op is to take a normal AIT Tensor “source” that contains the jagged Tensor’s data and return a jagged Tensor with the same data as source (with the is_view_of attribute set to source) and the first dimension set to a JaggedIntVar. The jagged Tensor resulting from this op can then be treated as jagged by other ops aware of the jagged Tensor semantics (e.g., elementwise). Importantly, the source Tensor is not sufficient for that, as it doesn’t carry the necessary jagged Tensor metadata (which the jagged Tensor does, in the first JaggedIntVar dimension of its shape).

Important: this op is the only right way to create a jagged Tensor. The reason is that the offsets Tensors passed to this op get registered in the graph and, as a result, can’t be optimized out. This wouldn’t be the case if the jagged Tensor would be “constructed” manually.

See the docstring of the JaggedIntVar class for more details on the jagged Tensor semantics and representation.

In the backend, the purpose of the make_jagged op is to setup the unified offsets representation for the jagged Tensor and to check the contents of the rank-1 offsets Tensors for consistency.

__init__ Args:

batch_dimIntVar: The batch dimension of the jagged Tensor. Importantly, this is different from the first dimension of the soruce Tensor, as it logically represents the number of variable- length sequences encoded by the JaggedIntVar. I.e., the batch_dim is B in the sum_B(N_B) representation of the JaggedIntVar.
jagged_dimsList[JaggedDim]: The list of jagged dimensions encoded in the JaggedIntVar of the resulting jagged Tensor. See the JaggedDim and JaggedIntVar class docstrings for the details.

__call__ Args:

sourceUnion[Tensor, List[Tensor]]: One or more source Tensors of the jagged Tensor(s) created by this op. The jagged Tensor is a view of the source Tensor. The main difference is that the resulting jagged Tensor’s first dimension is set to a JaggedIntVar, constructed from the batch_dim, jagged_dims, and the offsets_list. The same JaggedIntVar instance is set as the first dimension of every resulting jagged Tensor: one for each source Tensor in the source.
offsets_listList[Tensor]: The list of rank-1 offsets Tensors describing the variable-length layout of each of the jagged_dims. There must be exactly as many offsets Tensors in the offsets_list as there are JaggedDims in the jagged_dims list. Each offsets Tensor is associated with the corresponding JaggedDim before constructing a JaggedIntVar from them for the resulting jagged Tensor.

Returns:

Union[Tensor, List[Tensor]]: The resulting jagged Tensor or a list thereof, depending on whether the source argument is a Tensor or a List[Tensor].

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.masked_select[source]

Returns a 1D tensor containing elements of the input tensor selected by the boolean mask, similar to torch.masked_select.

Parameters:

input (Tensor) – the source tensor.
mask (Tensor, boolean) – the shapes of the mask tensor and the input tensor do not need to match, but they must be broadcastable.

Returns:

1D tensor of length equal to the total number of elements in broadcast shape: deduced from input and mask. The result is contained in the first num_nonmasked elements of output. The rest of the output tensor is not meaningful.
num_nonmasked: number of the non-masked elements from the input, i.e. the length of the: significant part of output.

Return type:

output

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.max_pool2d(kernel_size, stride, pad)[source]

Applies a 2D max pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, H, W)\), output \((N, C, H_{out}, W_{out})\) and kernel_size \((kH, kW)\) can be precisely described as:

\[\begin{split}\begin{aligned} out(N_i, C_j, h, w) ={} & \max_{m=0, \ldots, kH-1} \max_{n=0, \ldots, kW-1} \\ & \text{input}(N_i, C_j, \text{stride[0]} \times h + m, \text{stride[1]} \times w + n) \end{aligned}\end{split}\]

If pad is non-zero, then the input is implicitly padded with negative infinity on both sides.

.attr.:kernel_size: the size of the window
.attr.:stride: the stride of the window
.attr.:pad: implicit zero padding to be added on both sides

Parameters:: input (Tensor [N, H, W, C]) – the input tensor.
Returns:: Tensor [N, H_out, W_out, C].

class aitemplate.compiler.ops.mem_eff_attention(causal, dropout=0, variable_seq_length_kv=False, variable_seq_length_q=False, use_grouped_fmha=False)[source]

mem_eff_attention provides an implementation for fused multi-head attention module:

\[\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK}{\sqrt(d)}) * V\]

\[\text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O\]

where \(head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\).

Methods:

gen_function()

call backend functions

gen_function() → str[source]: call backend functions

class aitemplate.compiler.ops.multi_level_roi_align(num_rois, pooled_size, sampling_ratio, spatial_scale, position_sensitive, continuous_coordinate, im_shape)[source]

Performs Multiple level Region of Interest (RoI) Align operator with average pooling, as described in Mask R-CNN.

num_rois identifies the number of RoIs in the input.

pooled_size identifies the size of the pooling section, i.e., the size of the output (in bins or pixels) after the pooling is performed, as (height, width).

sampling_ratio is the number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio sampling points per bin are used. If <= 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height).

spatial_scale is a scaling factor that maps the box coordinates to the input coordinates. For example, if your boxes are defined on the scale of a 224x224 image and your input is a 112x112 feature map (resulting from a 0.5x scaling of the original image), you’ll want to set this to 0.5.

position_sensitive, a bool value.

continuous_coordinate, a bool value.

im_shape, original image shape.

Parameters:

p1 (Tensor[N, H//4, W//4, C]) – the feature map, i.e. a batch with N elements. Each element contains C feature maps of dimensions (H//4) x (W//4).
p2 (Tensor[N, H//8, W//8, C]) – the feature map, i.e. a batch with N elements. Each element contains C feature maps of dimensions (H//8) x (W//8).
p3 (Tensor[N, H//16, W//16, C]) – the feature map, i.e. a batch with N elements. Each element contains C feature maps of dimensions (H//16) x (W//16).
p4 (Tensor[N, H//32, W//32, C]) – the feature map, i.e. a batch with N elements. Each element contains C feature maps of dimensions (H//32) x (W//32).
rois (Tensor[roi_batch, 5]) – the list of RoIs and each ROI contains the index of the corresponding element in the batch, i.e. a number in [0, N - 1], and the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.

Returns:

the fixed-size feature maps, i.e., the pooled RoIs.

Return type:

Tensor[num_rois * N, pooled_size, pooled_size, C]

class aitemplate.compiler.ops.ndhwc3to8[source]

Pad the 3-channel input data to 8-channel.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.ne[source]

class aitemplate.compiler.ops.nhwc3to4[source]

class aitemplate.compiler.ops.nhwc3to8[source]

class aitemplate.compiler.ops.nms(preNmsTop=2000, nmsMaxOut=200, iouThreshold=0.5, minBoxSize=0)[source]

Performs non-maximum suppression (NMS) on the boxes according to their intersection-over-union (IoU).

NMS iteratively removes lower scoring boxes which have an IoU greater than iou_threshold with another (higher scoring) box.

Note: if multiple boxes have the exact same score and satisfy the IoU criterion with respect to a reference box, the selected box is not guaranteed to be the same for different backends.

preNmsTop identifies the maximum number of boxes to take.

nmsMaxOut identifies the maximum number of boxes to reserve after the operation.

iouThreshold identifies the intersection-over-union (IoU) threshold which is used to discards all overlapping boxes with IoU > iouThreshold.

minBoxSize identifies the minimum box size, if a box has size less than this value, it will be removed before the non-maximum suppression.

Parameters:

boxes (Tensor[N, 4])) – boxes to perform NMS on. They are expected to be in (x1, y1, x2, y2) format with 0 <= x1 < x2 and 0 <= y1 < y2.
scores (Tensor[N]) – scores for each one of the boxes

Returns:

int64 tensor with the indices of the elements that have been kept by NMS, sorted in decreasing order of scores

Return type:

Tensor

Methods:

`gen_function`()	call backend function
`gen_profiler`([workdir, ...])	Generates source files for profiling purpose.
`profile`([workdir, devices, ...])	Profile to compute the NMS Op workspace size.

gen_function() → str[source]: call backend function

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]

Generates source files for profiling purpose.

Parameters:

workdir (str, optional) – The directory to generate source files.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=None)[source]: Profile to compute the NMS Op workspace size.

aitemplate.compiler.ops.normalize_dtype(dtype: str) → str[source]

Returns a normalized dtype str.

Parameters:: dtype (str) – A data type string.
Returns:: normalized dtype str.
Return type:: str

class aitemplate.compiler.ops.pad_last_dim(ndim: int, out_dim: int)[source]

Pad the last dimension of the input data to the specified length.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.padded_dense_to_jagged(total_length: IntVar)[source]

Returns a jagged Tensor “extracted” from the input dense Tensor, given the offsets list. The resulting jagged Tensor contains the subset of values of the input dense Tensor specified by the rank-1 offset Tensors in the offsets_list.

Parameters:

x (Tensor) – input dense tensor.
offsets_list (List[Tensor]) – the list of offsets of the resulting jagged Tensor.
total_length (IntVar) – the total length dimension of the resulting jagged Tensor.

Returns:

a jagged Tensor extracted from the input dense Tensor x.

Return type:

y (Tensor)

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.perm021fc_ccr[source]

GEMM Specialization: A.permute(0, 2, 1) @ B

This op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm021fc_ccr_bias[source]

GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias)

This op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm021fc_ccr_bias_permute(layout='021')[source]

GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias).permute(0, 2, 1)

Note: This fusion may be slower than the non-fused version due to NVCC is not able to optimize the fused version.

This op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm021fc_crc[source]

GEMM Specialization: (A.permute(0, 2, 1) @ B)

This one is used when n/m gives you better alignment than m/k. Note: This op’s output is a ColMajor

This op is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(K, N).cuda().half()

XT = X_pt.permute(0, 2, 1)
XT = torch.reshape(XT, (-1, K))
WT = W_pt.transpose(0, 1).contiguous()
Y_pt = torch.nn.functional.linear(XT, WT)
Y_pt = torch.reshape(Y_pt, (B, M, N)).contiguous()

class aitemplate.compiler.ops.perm021fc_crc_bias[source]

GEMM Specialization: (A.permute(0, 2, 1) @ B + Bias)

This one is used when n/m gives you better alignment than m/k.

This op is equivalent to the following PyTorch code:

X_pt = torch.randn(B, K, M).cuda().half()
W_pt = torch.randn(K, N).cuda().half()
B_pt = torch.randn(N).cuda().half()

XT = X_pt.permute(0, 2, 1)
XT = torch.reshape(XT, (-1, K))
WT = W_pt.transpose(0, 1).contiguous()
Y_pt = torch.nn.functional.linear(XT, WT, bias=B_pt)
Y_pt = torch.reshape(Y_pt, (B, M, N)).contiguous()

class aitemplate.compiler.ops.perm102_bmm_rcr[source]

Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, n, k](col))

The op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm102_bmm_rcr_bias[source]

Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, n, k](col)) + bias[b, n]

The op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm102_bmm_rrr[source]

Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, k, n](row))

The op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.perm102_bmm_rrr_bias[source]

Batch GEMM specialization: C[m, b, n](row) = bmm(A[m, b, k](row), B[b, k, n](row)) + bias[b, n]

The op is equivalent to the following PyTorch code:

class aitemplate.compiler.ops.permute[source]

Returns a tensor with its dimensions permuted. This returned tensor is not a view. Dim in dims can be negative.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.permute021[source]

Permutes the input tensor from (B1, B2, …, Bn, N, M) to (B1, B2, …, Bn, M, N).

Parameters:: input (Tensor[B1, B2, ..., Bn, N, M]) – the source tensor with 3 dimensions
Returns:: the destination tensor
Return type:: output (Tensor[B1, B2, …, Bn, M, N])

Example

X = Tensor(shape=[2, 384, 262], name="X", is_input=True)
Y = ops.permute021()(X)
y_shape = [d._attrs["values"][0] for d in Y.shape()]
print(y_shape)

Outs:
[2, 262, 384]

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.permute0213[source]

Permutes the input 4d tensor from (B, N, M, K) to (B, M, N, K).

Parameters:: input (Tensor[B, N, M, K]) – the source tensor with 3 dimensions
Returns:: the destination tensor
Return type:: output (Tensor[B, M, N, K])

Example

X = Tensor(shape=[2, 384, 262, 10], name="X", is_input=True)
Y = ops.permute0213()(X)
y_shape = [d._attrs["values"][0] for d in Y.shape()]
print(y_shape)

Outs:
[2, 262, 384, 10]

Methods:

gen_function()

Generate function body.

gen_function() → str[source]: Generate function body.

class aitemplate.compiler.ops.permute102[source]

Permutes the input 3d tensor from (B, N, M) to (N, B, M).

Parameters:: input (Tensor[B, N, M]) – the source tensor with 3 dimensions
Returns:: the destination tensor
Return type:: output (Tensor[N, B, M])

Example

X = Tensor(shape=[2, 384, 262], name="X", is_input=True)
Y = ops.permute102()(X)
y_shape = [d._attrs["values"][0] for d in Y.shape()]
print(y_shape)

Outs:
[384, 2, 262]

Methods:

gen_function()

Generate function body.

gen_function() → str[source]: Generate function body.

class aitemplate.compiler.ops.permute210[source]

Permutes the input 3d tensor from (B, N, M) to (M, N, B).

Parameters:: input (Tensor[B, N, M]) – the source tensor with 3 dimensions
Returns:: the destination tensor
Return type:: output (Tensor[M, N, B])

Example

X = Tensor(shape=[2, 384, 262], name="X", is_input=True)
Y = ops.permute210()(X)
y_shape = [d._attrs["values"][0] for d in Y.shape()]
print(y_shape)

Outs:
[262, 384, 2]

Methods:

gen_function()

Generate function body

gen_function() → str[source]

Generate function body

Returns:: The function body string
Return type:: str

class aitemplate.compiler.ops.reduce_max(dim, keepdim=False, dtype=None)[source]

Implements the reduce_max op.

.attr.:dim : int or tuple of python:ints the dimension or dimensions to reduce
.attr.:keepdim : bool, optional keep the reduced dimensions if True, default is False
.attr.:dtype : str, optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor that contains the max of all elements in the input tensor.

class aitemplate.compiler.ops.reduce_mean(dim, keepdim=False, dtype=None)[source]

Implements the reduce_mean op.

.attr.:dim : int or tuple of python:ints the dimension or dimensions to reduce
.attr.:keepdim : bool, optional keep the reduced dimensions if True, default is False
.attr.:dtype : str, optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor that contains the mean value of all elements in the input tensor.

class aitemplate.compiler.ops.reduce_min(dim, keepdim=False, dtype=None)[source]

Implements the reduce_min op.

.attr.:dim : int or tuple of python:ints the dimension or dimensions to reduce
.attr.:keepdim : bool, optional keep the reduced dimensions if True, default is False
.attr.:dtype : str, optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor that contains the min of all elements in the input tensor.

class aitemplate.compiler.ops.reduce_sum(dim, keepdim=False, dtype=None)[source]

Implements the reduce_sum op.

.attr.:dim : int or tuple of python:ints the dimension or dimensions to reduce
.attr.:keepdim : bool, optional keep the reduced dimensions if True, default is False
.attr.:dtype : str, optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor that contains the sum of all elements in the input tensor.

class aitemplate.compiler.ops.reshape[source]

Returns a tensor with the same data and number of elements as input, but with the specified shape. Inputs must be contiguous.

A single dimension may be -1, in which case it’s inferred from the remaining dimensions and the number of elements in input.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.roi_align(num_rois, pooled_size, sampling_ratio, spatial_scale, position_sensitive, continuous_coordinate)[source]

Performs Region of Interest (RoI) Align operator with average pooling, as described in Mask R-CNN.

num_rois identifies the number of RoIs in the input.

pooled_size identifies the size of the pooling section, i.e., the size of the output (in bins or pixels) after the pooling is performed, as (height, width).

sampling_ratio is the number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio sampling points per bin are used. If <= 0, then an adaptive number of grid points are used (computed as ceil(roi_width / output_width), and likewise for height).

spatial_scale is a scaling factor that maps the box coordinates to the input coordinates. For example, if your boxes are defined on the scale of a 224x224 image and your input is a 112x112 feature map (resulting from a 0.5x scaling of the original image), you’ll want to set this to 0.5.

position_sensitive, a bool value.

continuous_coordinate. a bool value.

Parameters:

x (Tensor[N, H, W, C]) – the feature map, i.e. a batch with N elements. Each element contains C feature maps of dimensions H x W.
rois (Tensor[roi_batch, 5]) – the list of RoIs and each ROI contains the index of the corresponding element in the batch, i.e. a number in [0, N - 1], and the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. The coordinate must satisfy 0 <= x1 < x2 and 0 <= y1 < y2.

Returns:

the fixed-size feature maps, i.e., the pooled RoIs.

Return type:

Tensor[roi_batch, pooled_size, pooled_size, C]

aitemplate.compiler.ops.simplify_intvar_values(sym_val: Basic)[source]

Given a symbolic value, resolve the symbol’s value range.

Example: ‘symbol_A’ has value range of [10, 20] simplify_intvar_values(symbol_A * 3 + 4) returns [34, 64]

class aitemplate.compiler.ops.size[source]

Returns the size of the input tensor. If dim is not specified, the returned value is the same as tensor.shape(). If dim is specified, returns an int holding the size of that dimension.

This op doesn’t generate any code.

Methods:

gen_function()

call backend function

gen_function() → str[source]: call backend function

class aitemplate.compiler.ops.slice_reshape_scatter(scatter_dim: int, element_func: Optional[str] = None)[source]

represent slice + concat + reshape + concat pattern with slice + concat

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.slice_scatter(scatter_dim: int)[source]

This op represents a special fusion case where the inputs of a concatenate op all come from slice ops. In such a case, we can remove the concatenate op by placing each slice’s output into the correct location in the original concatenate’s output.

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.softmax[source]

Applies the Softmax function to a 2D input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.

Softmax is defined as:

\[\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\]

Parameters:

input (Tensor [N, M]) –
dim (int) – optional, a dimension along which Softmax will be computed (so every slice
Default (along dim will sum to 1).) – None, in this case the input tensor will be treated as
tensor. (a 1-D) –

Returns:

a Tensor of the same dimension and shape as the input with values in the range [0, 1].

Return type:

Tensor

Methods:

gen_function() → str[source]

Generate function body.

Returns:: The rendered template of generated function body.
Return type:: str

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=DynamicProfileStrategy.HINTS) → None[source]

Generator profiler. The profiler files are standalone executable for profiling.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy, used to filter generated profiles at compile time. See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=DynamicProfileStrategy.MAX)[source]

Selects the fastest kernel configurations.

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
devices (list, optional) – Devices used for profiling, by default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.split[source]

Splits the tensor into chunks on the specified dimension.

Parameters:

x (Tensor) – tensor to split.
split_sizes (List[int]) – list of sizes for each chunk
dim (int) – dimension along which to split the tensor

Returns:

the list of output tensors

Return type:

List[Tensor]

Example

>>> X = Tensor(shape=[2, 1], name="X", is_input=True)
>>> Y = ops.split()(X, 2, dim=0)
[Tensor(shape=[IntImm(1), IntImm(1)]), Tensor(shape=[IntImm(1), IntImm(1)])]

Methods:

`gen_function`()	Generates function source code string.
`remove_output_at`(indices)	This function removes the outputs in indices from the "outputs" attribute and sets output_masks[indices] to be False.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

remove_output_at(indices: Union[int, Sequence[int]]) → None[source]

This function removes the outputs in indices from the “outputs” attribute and sets output_masks[indices] to be False. Note that the indices are based on the current “outputs”.

Parameters:: indices (Union[int, Sequence[int]]) – the index of an output or indices of multiple outputs based on the current “outputs”
Return type:: None

class aitemplate.compiler.ops.squeeze(dim: Optional[int])[source]

Examines the specified dimension and gets rid of it if it is of size 1.

>>> x = Tensor(shape=[IntImm(3), IntImm(2), IntImm(1)])
>>> squeeze(2)(x)
Tensor(shape=[IntImm(3), IntImm(2)])

>>> x = Tensor(shape=[IntImm(3), IntImm(2), IntImm(1)])
>>> squeeze(1)(x)
Tensor(shape=[IntImm(3), IntImm(2), IntImm(1)])

>>> x = Tensor(shape=[IntImm(4), IntImm(1), IntImm(3)])
>>> squeeze(-2)(x)
Tensor(shape=[IntImm(4), IntImm(3)])

>>> x = Tensor(shape=[IntImm(1), IntImm(1), IntImm(4)])
>>> squeeze(None)(x)
Tensor(shape=[IntImm(4)])

There are some additional assumptions for dynamic dims. Since our shape inference system cannot handle outputs with variable outputs, we assume that if a dynamic dim is squeezed, it contains no ones:

>>> x = Tensor(shape=[IntVar([3, 2]), IntImm(2)])
>>> y = Tensor(shape=[IntVar([1, 2]), IntImm(2)])
>>> squeeze(0)(x) # OK
Tensor(shape=[IntVar([3, 2]), IntImm(2)])
>>> squeeze(1)(y) # error!

dim (Optional[int]) : the dimension to get rid of. If None, get rid of all dimensions of size 1.

Parameters:: x (Tensor) – the source tensor to squeeze.
Returns:: the squeezed tensor.
Return type:: Tensor

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –

class aitemplate.compiler.ops.topk(k)[source]

Returns the k largest elements of the given input tensor along its last dimension.

k the k in “top-k”.

Parameters:: x (Tensor) – the input tensor
Returns:: the output tensor with last dimension being k.
Return type:: Tensor

Example

X = Tensor(shape=[2, 800], name="X", is_input=True)
value, indice = ops.topk(k=300)(X)
y_shape = [d._attrs["values"][0] for d in indice.shape()]
print(y_shape)

Outs:
[2, 300]

Methods:

`gen_function`()	call backend function
`gen_profiler`([workdir, ...])	Profile TopK to get workspace :param workdir: [description], by default None :type workdir: str, optional :param dynamic_profiling_strategy: A dynamic profiling strategy, used to filter generated profiles at compile time. See also: `profile()` :type dynamic_profiling_strategy: DynamicProfileStrategy, optional.
`profile`([workdir, devices, ...])	Get the TopK Op workspace

gen_function() → str[source]: call backend function

gen_profiler(workdir: Optional[str] = None, dynamic_profiling_strategy=None) → None[source]

Profile TopK to get workspace :param workdir: [description], by default None :type workdir: str, optional :param dynamic_profiling_strategy: A dynamic profiling strategy, used to filter generated profiles at compile time.

See also: profile()

profile(workdir='./', devices=None, dynamic_profiling_strategy=None)[source]

Get the TopK Op workspace

Parameters:

workdir (str, optional) – Base dir to keep profiling source codes, by default “./”
devices (list, optional) – Devices used for profiling, by default device 0 will be used.
dynamic_profiling_strategy (DynamicProfileStrategy, optional) – A dynamic profiling strategy. By default MAX is used, i.e. to profile a dynamic range, an upper bound will be used.

class aitemplate.compiler.ops.transpose[source]: Returns a tensor with its two dimensions transposed. This returned tensor is not a view. Dims can be negative.

class aitemplate.compiler.ops.transposed_conv2d(stride, pad, dilate=1, group=1)[source]

Transposed conv2d.

Applies a 2D transposed convolution on input in shape (N, H, W, C_in) and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

This module can be seen as the gradient of Conv2d with respect to its input. It is also known as a fractionally-strided convolution or a deconvolution (although it is not an actual deconvolution operation as it does not compute a true inverse of convolution). For more information, see the visualizations here and the Deconvolutional Networks paper.

stride controls the stride for the cross-correlation.
pad controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
dilate controls the spacing between the kernel points; also known as the à trous algorithm. It is harder to describe, but the link here has a nice visualization of what dilation does.
group controls the number of blocked connections from input channels to output channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
OP = aitemplate.compiler.ops.transposed_conv2d(stride=1, pad=1, dilate=1)
Y = OP(X, W)

X_pt = NHWC2NCHW(X_ait)
W_pt = NHWC2NCHW(W_ait)
Y_pt = torch.nn.functional.conv_transpose2d(X_pt, W_pt)
Y = NCHW2NHWC(Y_pt)

class aitemplate.compiler.ops.transposed_conv2d_bias(stride, pad, dilate=1, group=1)[source]

Transposed conv2d with bias.

Applies a 2D transposed convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out) and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias –
optional bias tensor of shape \((\text{out\_channels})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
OP = aitemplate.compiler.ops.transposed_conv2d_bias(stride=1, pad=1, dilate=1)
Y = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = MHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)
Y_pt = torch.nn.functional.conv_transpose2d(X_pt, W_pt, bias=B_Pt)
Y = nchw2nhwc(Y_pt)

class aitemplate.compiler.ops.transposed_conv2d_bias_relu(stride, pad, dilate=1, group=1)[source]

Transposed conv2d with bias + relu.

Applies a 2D transposed convolution on input in shape (N, H, W, C_in), adds a bias in shape (C_out), performs relu and produces output in shape (N, H_out, W_out, C_out). N is batch size, H, W are the height and width of the input images in pixels, and C is the number of channels.

Parameters:

input – input tensor of shape \((N , H , W, \text{in\_channels})\)
weight – filters of shape \((\text{out\_channels} , K_h, K_w, \frac{\text{in\_channels}}{\text{groups}})\)
bias –
optional bias tensor of shape \((\text{out\_channels})\)

This operator uses “channels_last” data format. Below is an example and its equivalence in PyTorch:

X = Tensor(shape=[N, H, W, C_in], dtype="float16", name="images", is_input=True)
W = Tensor(shape=[C_out, K_h, K_w, C_in], dtype="float16", name="weight", is_input=True)
B = Tensor(shape=[C_out], dtype="float16", name="bias", is_input=True)
OP = aitemplate.compiler.ops.transposed_conv2d_bias_relu(stride=1, pad=1, dilate=1)
Y = OP(X, W, B)

X_pt = NHWC2NCHW(X_ait)
W_pt = MHWC2NCHW(W_ait)
B_pt = NHWC2NCHW(B_ait)
Y_pt = torch.nn.functional.conv_transpose2d(X_pt, W_pt, bias=B_pt)
Result_pt = torch.nn.functional.relu(Y_pt)
Result = NCHW2NHWC(Result_pt)

class aitemplate.compiler.ops.tuple_construct[source]: Construct a tuple of tensors.

class aitemplate.compiler.ops.unsqueeze(dim: int)[source]

Adds a dimension of size 1 at a specified location. >>> x = Tensor(shape=[IntImm(4), IntImm(3)]) >>> unsqueeze(0)(x) Tensor(shape=[IntImm(1), IntImm(4), IntImm(3)]) >>> unsqueeze(-1)(x) Tensor(shape=[IntImm(4), IntImm(3), IntImm(1)])

Parameters:: dim (int) – Where to add the dimension, must be in range [-input_ndim - 1, input_dim + 1)

class aitemplate.compiler.ops.upsampling2d(scale_factor, mode)[source]

Applies a 2D bilinear upsampling to an input signal composed of several input channels.

To specify the scale, it takes the scale_factor as it’s constructor argument.

scale_factor (float): multiplier for spatial size.

Parameters:: input (Tensor [N, H, W, C]) – the input data.
Returns:: Tensor [N, H_out, W_out, C].

class aitemplate.compiler.ops.upsampling2d_add(scale_factor, mode)[source]

Fused op for bilinear_upsampling + add.

Applies a 2D bilinear upsampling to an input signal composed of several input channels, and adds an residual.

To specify the scale, it takes the scale_factor as it’s constructor argument.

scale_factor (float): multiplier for spatial size.

Parameters:

input (Tensor [N, H, W, C]) – the input data.
r (Tensor [N, H_out, W_out, C]) – the residual.

Returns:

Tensor [N, H_out, W_out, C].

class aitemplate.compiler.ops.var(dim, unbiased, keepdim=False, dtype=None)[source]

Calculates the variance of all elements in the input tensor.

.attr.:dim : int or tuple of python:ints the dimension or dimensions to reduce
.attr.:unbiased : bool specifying whether to use Bessel’s correction or not
.attr.:keepdim : bool, optional keep the reduced dimensions if True, default is False
.attr.:dtype : str, optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor.

class aitemplate.compiler.ops.vector_norm(ord_kind=2, dim=None, keepdim=False, dtype=None)[source]

Vector_norm op implementation that simulates pytorch’s linalg.vector_norm. Currently, we only support L2 norm.

.attr.:ord_kind (int or float or str), optional specifies the vector norm to be computed. (default: 2)
.attr.:dim (None or int or tuple of python:ints), optional the dimension or dimensions to be normalized. (default: None, in this case the input tensor will be treated as a 1-D tensor)
.attr.:keepdim (bool), optional keep the normalized dimensions if True, default is False
.attr.:dtype (str), optional the type of the return tensor. If it is not None, the input tensor is cast to dtype before reduction.

Parameters:: input (Tensor) – the input tensor.
Returns:: Tensor.

class aitemplate.compiler.ops.where[source]

Return a tensor of elements selected from either input or other, depending on condition.

Parameters:

condition (A bool Tensor) – When True (nonzero), yield input, otherwise yield other
input_tensor (Tensor or Scalar) – value (if input is a scalar) or values selected at indices where condition is True
other_tensor (Tensor or Scalar) – value (if other is a scalar) or values selected at indices where condition is False
dtype – output dtype if both input_tensor and output_tensor is scalar

Returns:

A tensor of shape equal to the shape of condition

Return type:

Tensor

Methods:

gen_function()

Generates function source code string.

gen_function() → str[source]

Generates function source code string.

Returns:: str
Return type:: a string which contains C++ function implementation source code.
Raises:: NotImplementedError –