# Limitations

This chapter describes limitations discovered in this release
during testing. Future releases will provide fixes for
discovered issues.

General Qualcomm® Neural Processing SDK Limitations

- Qualcomm® Neural Processing SDK currently supports 4D input data, where the first
dimension is batch.
- Only batch of 1 is supported for RCNN networks like
Faster-RCNN. See [Layer
Limitations](https://docs.qualcomm.com/doc/80-63442-10/topic/limitations.html#limitations_layers) below.

General Java API Limitations

- Confine INeuralNetwork instance usage to a single thread
The current SDK INeuralNetwork class instances are meant to
be accessed from a single thread. Developers must make sure
that is enforced within the application or unexpected errors
may occur.

General CPU Runtime Limitations

- Not all layers have been optimized for the CPU runtime. For
example, deconvolution is dramatically slower than
convolution.

General GPU Runtime Limitations

- In GPU\_FLOAT32\_16\_HYBRID mode, the GPU kernels use
HALF\_FLOAT precision for all intermediate data handling and
FULL\_FLOAT precision for all of its computations. While this
does not typically affect mAP for networks that are being
used for classification this can overflow/underflow which
can impact use of the engine for uses other than
classification. If an impact is observed, try running with
the CPU runtime which is always FULL\_FLOAT to validate any
overflow/underflow issues.
- In GPU\_FLOAT16 mode, the GPU kernels use HALF\_FLOAT
precision for all intermediate data handling and all of its
computations. In this mode, due to lower computation
precision comparing to GPU\_FLOAT32\_16\_HYBRID, chances of
negative impact on network’s accuracy (e.g. mAP score) are
higher. Users are encouraged to test accuracy performance of
their network using this mode to ensure it meets
requirements of their use case.
- For absolute size restrictions, the concept of “packed”
channels refers to the number of channels divided by 4, and
rounded up to the nearest integer:
**packed\_channels = ceil(channels / 4.0)**
- Whenever a layer has a 4-dimensional (i.e. batch x width x
height x channels) component, such as input, output, or
weight tensor, that component will have the following size
restrictions:

    - Number of packed channels \* width &lt; MaxPerGPUSize
- For all layers that have weights/biases, restrictions are:

    - Filter size \* filter size \* 4 &lt;= MaxPerGPUSize
    - Number of output channels / 4 &lt;= MaxPerGPUSize
- The MaxPerGPUSize is dependent on Qualcomm Adreno™ GPU type
and the values are given below

    - A330: 8192
    - A430, A530: 16384
- While loading any network, GPU runtime may choose to merge
(squash) few layers with the previous layers in the network,
depending on the compatibility of the layers. This results
in missing performance information for the squashed layers.

General DSP Runtime Limitations

- When using non-quantized models, the first network execution
after network initialization may be significantly slower
than subsequent executions. To avoid this, use a DLC file
that has been quantized by
[snpe-dlc-quantize](https://docs.qualcomm.com/doc/80-63442-10/topic/SNPE_general_tools.html#tools_snpe-dlc-quantize).

General AIP Runtime Limitations

- If the input layer of a network needs to be processed by HTA
the input must be a 4D tensor with shape format as NHWC
where the batch dimension N must be 1 and the number of
channels C cannot exceed 16.
- However, one could take advantage of manually partitioning a
network to bypass this limitation by having the input layer
be processed on the HVX instead.
- AIP runtime supports batched input for models which are
completely using HTA or the models which have all the layers
running on HTA except Softmax which is partitioned to HVX.

Layer Limitations

- **ArgMax**

    - For DSP runtime, ArgMax only outputs float; its output
cannot be a quantized data type due to accuracy.
- **Color space conversion**

    - For NV21 input image encoding type, width or height must
be multiple of 2. The reason is 4 Y (2wx2h) is sharing
one UV pair.
- **Concatenation**

    - For GPU runtime, the number of input channels in each of
the inputs can assume arbitrary values. However, if one
or more of these are not a multiple of 4, performance of
the layer will be diminished.
- **Convolution**

    - For GPU runtime, when the number of groups is greater
than 1, the number of output channels must be a multiple
of 4 \* the number of groups. For example, with 2 groups,
the number of output channels must be a multiple of 8
(4\*2=8).
- **Crop**

    - For GPU runtime, the number of input channels in each of
the inputs must be a multiple of 4.
    - Crop on the DSP is not optimized in all cases. Spatial
cropping is optimized (cropping height and/or width,
leaving other dimensions unchanged)
- **Deconvolution**

    - For GPU and CPU runtime, the number of output channels
(i.e. number of filters) can be any value (not
necessarily a multiple of 4).
    - For GPU runtime the following limitations apply:

        - number of packed input channels \* number output
channels &lt;= MaxPerGPUSize
        - Filter size-X \* Filter size-Y &lt;= MaxPerGPUSize
        - Stride &lt;= filter size
    - For DSP runtime, deconvolutions with stride &gt; 4 are not
fully optimized.
- **Depthwise Convolution**

    - Depthwise Convolution on the DSP is not optimized for all
cases. The following case is optimized:

        - Horizontal stride is &lt;= 2.
        - Filter is 3x3.
        - Depth is a multiple of 32.
- **Detection Output**

    - keepTopK must be provided.
    - Output buffer must be of sufficient size and in Float
format.
    - For DSP runtime, batch &gt; 1 and dlc caching is not
supported.
- **Fully connected**

    - For GPU runtime, the following limitations apply:

        - Input width \* input height \* number of input
channels &lt;= MaxPerGPUSize
        - Number of output channels &lt;= MaxPerGPUSize
    - For DSP Runtime, batch &gt; 1 is optimized only when input
height \* width \* channel is a multiple of 16.
- **Input Image Scaling**

    - The DSP runtime image scaling performs well under the
conditions listed below. Other configurations are not
optimized.

        - Scale factor is an upscale by 2x AND
        - Depth is a power of 2 AND either

            - Depth is less than 128 with width equal to a power
of 2 OR
            - Depth is greater than 128.
- **Instance Normalization**

    - For certain models containing InstanceNorm layers, the
default value for the “epsilon” parameter could
overwhelm the standard deviation of the input tensor.
In such cases a numerical discrepancy between the
source framework and Qualcomm® Neural Processing SDK can happen. For such cases it
helps to override the value of epsilon in the source
model to a much smaller value.
- **Pad**

    - For the DSP runtime,

        - does not support non 4D padding inputs.
        - does not support padding along batch.
        - does not support padding along depth for reflect
padding.
- **Power**

    - Power layer is only supported on DSP.
- **Proposal**

    - Proposal layer is not supported on the GPU.
    - Only batch of 1 is supported.
- **ROI Pooling**

    - ROI Pooling is not supported on the GPU.
    - For DSP runtime, the input to the ROI Pooling layer must
be a Proposal layer or an OPAQUE Input layer.
    - Only batch of 1 is supported.
- **Scale**

    - Scale is only supported on the DSP.
    - For DSP runtime, only channel scaling is supported.
- **Slice**

    - Currently does not support creation of a slice layer
without slice points defined.
- **Tile**

    - The Tile layer will currently be displayed as a
“Concatenation” layer when the topology of a network
containing it is viewed using snpe-dlc-info.
- **UDO**

    - **DSP runtime**

        - Qualcomm® Neural Processing SDK DSP requires a quantized model if the UDO has at
least one quantized output.
        - The data types supported in DSP UDO layers are
FLOAT\_32 and UINT\_8 (quantized with TF schema).
    - **GPU runtime**

        - Only 16-bit floating point (OpenCL half) activations
are supported in the network.
        - The only data type supported for activation tensors in
GPU UDO layers is FLOAT\_16.
    - **CPU runtime**

        - CPU runtime always operations with full precision
(FP32) tensors.
        - The only data type supported for activation tensors in
CPU UDO layers is FLOAT\_32.
    - **Package Generation**

        - Multiple UDOs cannot be defined in a single config
file if they are intended to be used with core type =
DSP.
In this case users are required to create one config
file per UDO and generate separate packages with each
op. This restriction does not apply to core types CPU
or GPU.
        - A tensor parameter in a UDO definition can be
expressed with only one data type (e.g: either
FLOAT\_32 or FLOAT\_16 but not both).
Users wanting to use their UDOs on multiple runtimes
with different data types may be required to create
separated config files per data type and generate
multiple corresponding packages.
    - **Application**

        - UDO integration is supported only with native C APIs.
Java extensions are not available in this release.
Users who want to integrate UDOs into Android
applications will have to interface with Qualcomm® Neural Processing SDK APIs at
the JNI level in order to take advantage of this
functionality.

Tool Limitations

- Default input raw datatype is float32.
- **snpe-net-run**

    - Default profiling level is detailed.
- **snpe\_bench.py**

    - Default profiling level is basic.
- **snpe-dlc-info**

    - For deconvolution layers, the num filters value shown is
actually num filters / group.

        - Example: snpe-dlc-info shows num filters as 1 for a
deconvolution layer with num\_output of 11 and group of
11.
- **snpe-tensorflow-to-dlc**

    - The TensorFlow converter does not support conversion of
TensorFlow graphs that have been quantized using
TensorFlow tools. In order to quantize a TensorFlow
model, run the TensorFlow converter
([snpe-tensorflow-to-dlc](https://docs.qualcomm.com/doc/80-63442-10/topic/SNPE_general_tools.html#tools_snpe-tensorflow-to-dlc))
first, then run
[snpe-dlc-quantize](https://docs.qualcomm.com/doc/80-63442-10/topic/SNPE_general_tools.html#tools_snpe-dlc-quantize)
on the DLC file generated by the TensorFlow converter.
    - Convolution

        - BiasAdd node is optional and when missing a bias of
zeros will be added.
    - Concat

        - Concat node must have at least 2 non Const inputs.
    - ElementWise Sum/Mul/Max

        - Must be the only operation within it’s scope.
        - Does not support scalar operands.
    - Fully Connected

        - Inputs to MatMul operation must be 1D.

Last Published: Jul 02, 2026

[Previous Topic
API](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/SNPE_general_api.md) [Next Topic
Revision History](https://docs.qualcomm.com/bundle/publicresource/80-63442-10/topics/SNPE_general_revision_history.md)