# Reducing TCM requirements for performance and functionality

- [Reducing the required input and output sizes](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#reducing-the-required-input-and-output-sizes)
- [Padding choices](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#padding-choices)
- [Reduce precision](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#reduce-precision)
- [Reducing TCM pressure throughout the network](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#reducing-tcm-pressure-throughout-the-network)
- [Generic rule of thumb guidelines for various op activations to fit in TCM](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#generic-rule-of-thumb-guidelines-for-various-op-activations-to-fit-in-tcm)

The firmware divides work into smaller pieces (tiles) and attempts to schedule and allocate
those smaller work items in a way that improves performance and reduces traffic to external
memory as much as possible.  Working on horizontal strips of activation data is typically
the most efficient in execution, and so we try to execute in that format as much as possible.
However, for very large activations the amount of memory required to process a horizontal
strip of data can be very large.  This is especially true on devices with smaller TCM sizes.

## [Reducing the required input and output sizes](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#id1)

To compute a block of output data, we need one or more blocks of input data.  Striding, Dilation,
padding, and filter sizes all increase the amount of input data required.  By making the required
input data smaller, you can reduce the pressure on TCM.  At the current time, this is especially
true for the width and depth dimension; reducing width and depth can have a very strong impact on
TCM usage requirements.

## [Padding choices](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#id2)

When a convolution is padded by a small amount to produce the same output size as input size, we
typically require three tiles of input data to be available: the data above, the data below, and
the data in the same location.  If there is a large sequence of convolutions, consider applying a
large padding early in the series of convolutions and then using “VALID” or zero-padding convolutions
for the subsequent operations.  The zero-padding convolutions typically need two tiles of input
data instead of three.

## [Reduce precision](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#id3)

16-bit activation data takes twice the amount of memory that 8-bit activation data does.  Consider
trying smaller data types if possible.

## [Reducing TCM pressure throughout the network](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#id4)

While designing a network, please also consider the activation sizes throughout the body of the network
and not just the input/output sizes of the network. If there are activations in the body of the
network that are large enough to not fit in TCM or create significant TCM pressure it can result in
the network failing to prepare and/or having significant performance impact. It is possible for a
network with a low input resolution to have TCM size issues if it has significantly large activations
in the body of the network. We continue to improve the engine’s capability to handle larger activations
in the future.

## [Generic rule of thumb guidelines for various op activations to fit in TCM](https://docs.qualcomm.com/doc/80-63442-10/topic/htp_guidelines_tcm_requirements.html#id5)

Weights are stored in TCM and take up TCM space.
Some generic rule of thumb guidelines based on activations are provided below for various kinds of ops
to determine sizing for different TCM configurations.

- Activation widths need to be rounded up to the nearest multiple of 8(uint8)/4(uint16)
- Activation depths need to be rounded up to the nearest multiple of 32 for both uint8/16
- Activation height need to be rounded up to the nearest multiple of 8 for both uint8/16

The following equations can be used to get an approximate idea of TCM fit.

- - Unary Elementwise Ops
    - - ACTIVATION\_WIDTH\*256\*elsize\*2  &lt;=  TCM\_SIZE/2
    - ACTIVATION\_WIDTH needs to be rounded up to nearest multiple of 8(uint8)/4(uint16)
    - If the LHS above is &gt; TCM\_SIZE/2 but less than TCM\_SIZE it might still fit but will adversely affect performance
- - Binary Elementwise Ops
    - - ACTIVATION\_WIDTH\*256\*elsize\*3  &lt;= TCM\_SIZE/2
    - ACTIVATION\_WIDTH needs to be rounded up to nearest multiple of 8 for uint8 and 4 for uint16
    - If the LHS above is &gt; TCM\_SIZE/2 but less than TCM\_SIZE it might still fit but may adversely affect performance
- - Convolution
    - - ACTIVATION\_INPUT\_WIDTH \* ACTIVATION\_INPUT\_DEPTH \* FILTER\_HEIGHT \* tiling\_dimension + FILTER\_WIDTH \* FILTER\_HEIGHT \* FILTER\_DEPTH \* number of channels + ACTIVATION\_OUTPUT\_WIDTH \* tiling\_dimension &lt;= TCM\_SIZE/2
    - ACTIVATION\_INPUT\_WIDTH and ACTIVATION\_OUTPUT\_WIDTH needs to be rounded up to nearest multiple of 8
    - (uint8)/4(uint16)ACTIVATION\_INPUT\_DEPTH needs to be rounded up to nearest multiple of 32
    - FILTER\_DEPTH needs to be rounded up to nearest multiple of 32
    - Stride-2 and dilated convolution may have higher VTCM requirements
    - tiling\_dimension: 8 (8 byte height as part of the tile dimension WxHxD = 8x8x32)
    - number of channels: 32 (32 channels as minimum chunk for the output)
- - GlobalAvgPool
    - - ACTIVATION\_WIDTH \* 256 \* elsize &lt;= TCM\_SIZE/2
    - ACTIVATION\_WIDTH needs to be rounded up to nearest multiple of 8(uint8)/4(uint16)
- - Concat
    - - If the dimension of each input on the axis being concatenated except for the last input being concatenated is a multiple of

> 
> 
> - 8, if the axis being concatenated on is height,
>         - 8, (uint8)/4(uint16) if the axis being concatenated on is width or
>         - 32, if the axis being concatenated on is depth

        then the size of the concat’s output is rounded up to 8 (uint8)/4(uint16) in width, 8 in height and 32 in depth
    - If the above constraint (multiple of 8 if height or 8(uint8)/4(uint16) width and multiple of 32 if depth) is not met, then
the size of the concat’s output is rounded up to 8 (uint8)/4(uint16) in width,8 in height and 32 in depth + sum of all the
concat’s inputs rounded up to 8 (uint8)/4(uint16) in width, 8 in height and 32 in depth
- - NMS (Non-Maxima Suppression)
    - - Needs all the data to fit in &lt;= TCM\_SIZE/2
- - TopK
    - - For TopK Accuracy classification score computation, the following must be taken into consideration:

> 
> 
> - AXIS\_DIM\*256\*elsize\*2 &lt;= TCM\_SIZE/2

Last Published: Jun 04, 2026

[Previous Topic
Use Space-to-depth Transformation where possible](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_guidelines_space_to_depth.md) [Next Topic
Choice of Activation Functions](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/htp_guidelines_activation_functions.md)