# Scheduling and Allocation

QNN HTP is high in performance goal due to the support of parallelism
and resource utilization. The initial graph is constructed though series
of `append_node` calls; after which graph goes through `prepare`
phase. With cost and dependency information, ordering can be determined
algorithmically.

In QNN HTP, both scheduling and allocation is done in
`Graph::prepare()` stage. As an overview, the following occurs in
regards to *scheduling and allocation*:

1. Memory **blocks are registered** with the allocator.

    - During `prepare`, before scheduling and allocation, all blocks
of data are registered with the allocator, informing the allocator
of their memory type and minimum size and alignment requirements.
The two types of memory blocks are `Plain` and `TCM`. `TCM`
here refers to the `VTCM`.
2. **Pre-Scheduler** fits as much data into `TCM` as possible.

    - At this point, the scheduler tries to develop a topological
ordering which reduces `TCM` usage by iteratively partitioning
the graph at `low-TCM` boundaries. This outputs a “runlist”.
3. **Spill/fill nodes are inserted** where necessary.

    - Based on the runlist output by the pre-scheduler, the spill pass
adds up the requested `VTCM` at each op. It is possible for the
requested `VTCM` at an op to be much higher than what is input
and output by the op itself since other blocks of data might still
be in `VTCM` that were output earlier and not used as input
until later. To reduce the required `VTCM` usage across ranges
of ops, the spill pass inserts spill and fill ops that copy data
out of `VTCM` temporarily to make room for other data and then
copy it back in before it is needed later.
4. Some **ops are split** into launch-wait pairs.

    - Some ops have the ability to be run using background resources.
This is where those ops are split into pairs that launch the
operation onto some background resource and then wait for
completion in order to prevent another op from starting before its
inputs are ready.
5. **Offsets are allocated** for blocks that reside in `VTCM`.

    - The allocator takes in the modified runlist after spills and fills
have been inserted. Using the requirements for each block of data
that were registered earlier, the allocator assigns offsets to
each `TCM` block within `VTCM`. If two blocks of data do not
have to be in `VTCM` at the same time, then the allocator might
assign offsets to those two blocks of data such that their address
ranges overlap. This can cause the situation where two ops that
could have been rearranged in any order can no longer be swapped
because doing so would cause some blocks of data that were
allocated in an overlapping manner to be needed in `VTCM` at the
same time. The allocator tries to reduce this situation where it
can, since these new restrictions can constrain available
parallelism.
6. Ops are **re-scheduled** to maximize parallelism.

    - The final scheduler moves some ops earlier/later to increase
parallelism while respecting dependencies within the allocated
graph. This pass takes in the existing runlist and outputs a new
runlist that has been optimized for parallelism. The final
scheduler runs after allocation has been performed, so must obey
the restrictions the allocator introduced by allocating some
blocks at overlapping address ranges within `VTCM`.

Last Published: Jun 04, 2026

[Previous Topic
QNN HTP-FP16 Op Package - Relu Op Example](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/relu_fp16_example.md) [Next Topic
Allocate Memory for Scratch Buffers](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/scratch_buffer.md)