Python Runtime Note

Python Model

Model is a collection of Python bindings to the C++ AIT runtime. This section describes the API.

AITData

This class represents a contiguous blob of memory that AIT will use as a tensor. It is simply a named tuple with these fields:

  • data_ptr: int: An unowned pointer to GPU memory. In general, all of the APIs expect that this pointer will be valid for the entire duration of the call.

  • shape: List[int]: The shape of the tensor.

  • dtype: str: The tensor’s dtype; one of “float32”, “float16”, “int32”, “int64”. Note that most ops only support float16 at this stage.

When using AITemplate with PyTorch, AITData can be constructed with the torch_to_ait_data utility:

x = torch.randn(3, 3, 3).half().cuda()
# Equivalent to AITData(x.data_ptr(), [3, 3, 3], "float16")
x_ait = torch_to_ait_data(x)

If PyTorch is not available, Model provides a set of functions for copying, allocating, and freeing GPU memory. See the docstrings in compiler/model.py for more information.

run

run takes inputs and outputs as collections of AITData instances. Both arguments can be passed as either an ordered list or a dictionary (mapping name to tensor).

# Arguments as a dictionary
module.run(
  {"input0": in0_ait, "input1": in1_ait},
  {"output0": out0_ait, "output1": out1_ait},
)

# Arguments as an ordered list. Note that you might need to query
# the index mapping.
input_name_to_idx = module.get_input_name_to_index_map()
output_name_to_idx = module.get_output_name_to_index_map()

inputs = [None] * len(input_name_to_idx)
outputs = [None] * len(output_name_to_idx)

for name in input_name_to_idx:
  inputs[input_name_to_idx[name]] = ait_inputs[name]

for name in output_name_to_idx:
  outputs[output_name_to_idx[name]] = ait_outputs[name]

module.run(inputs, outputs)

One important caveat is that the output must have the maximum possible size. This is because of dynamic shapes: the size of the output may vary, but its shape is not inferred until inference time. The maximum shape can be queried with the get_output_maximum_shape():

# Can use either name or index.
name_to_idx = module.get_output_name_to_idx()
max_shape = module.get_output_maximum_shape(name_to_idx["output"])
max_shape = module.get_output_maximum_shape("output")

Model.run returns a dictionary of output AITData instances with (possibly dynamic) shapes that inferred in the runtime.

Nullptr Inputs/Outputs

In general, inputs are allowed to be null if they are size 0 (e.g. at least one dimension is 0). The runtime enforces this with a check before any kernels are launched.

If (input_name == nullptr && dim0 * dim1 * … * dimN != 0) {
  throw std::runtime_error(“input_name cannot be null!”);
}

This is convenient since torch.data_ptr() returns null for size zero tensors. The dynamic shape computation is skipped if the lower bound of the tensor’s size is positive.

Constants

There are two types of constants in AIT; bound and unbound constants. A bound constant is known at compile time and may participate in constant folding. Bound constants are copied into GPU memory at model loading time. Values for bound constants may be provided by passing a dictionary (mapping constant name to AIT tensor) to compile_model.

Unbound constants, on the other hand, do not participate in constant folding and must be provided before running the model. These must be set via Model.set_constant:

module.set_constant("my_constant", AITData(...))
# The pointer in the the tensor must live for the entire duration of run()
module.run(...)

Constants are read-only and shared with all runtimes in the ModelContainer.

run_with_tensors

run_with_tensors is a convenience method with the same interface as run, except it can take lists (or dicts) of torch.Tensor instances:

input0 = torch.randn(input0_shape).cuda().half()
output0 = torch.empty(output0_shape).cuda().half()
# Returns a dictionary of reshaped outputs
result = module.run_with_tensors([input0], [output0])

Streams and Asynchronous Predictions

A pointer to a stream can optionally be passed to run. If none is given, the prediction happens on the default stream 0. If the sync argument is set to True, the stream is synchronized before run() returns. sync is True by default.

Multiple predictions can happen at the same time (on the same or different streams). Under the hood, there is a fixed-size pool of runtime objects. When all the runtimes are used, run() blocks until one becomes available. The size of this pool can be configured with the num_runtimes option in Model’s constructor.

CUDA Graph

Run also takes a graph_mode option. If set to true, the runtime will try to use [CUDA graphs](https://developer.nvidia.com/blog/cuda-graphs/) to run the model. graph_mode is not supported on ROCm.

The following is a high level overview of how graph mode works:

  1. Each Model has an internal stream used for graph capturing. The model first runs all ops on this stream in capture mode. No kernel launches happen during this stage.

  2. If this is the first run, a graph is instantiated via cudaGraphInstantiate.

  3. On subsequent runs, we try to avoid the relatively expensive cudaGraphInstantiate call by updating the graph executor (cudaGraphExecUpdate). However, a new graph may still be instantiated if the topology of the graph somehow changed between runs.

  4. Once we have the graph executor, we launch a single kernel on the stream that the user provided to run().

Graph mode is mainly beneficial when there are many small kernel launches. A lot of overhead can be avoided since there is only a single kernel launch in graph mode.