# Performance Tips

Performance Tips for Using Tensors

UserBuffer

By default Qualcomm® Neural Processing SDK creates networks that accept tensors, where for
each SNPE::execute() there is an additional copy to get data
into/out of Qualcomm® Neural Processing SDK. In addition, depending on the data format
required by the underlying target runtime, Qualcomm® Neural Processing SDK may perform
format conversion such as quantization or float expansion.

An alternative is to create networks that accept user buffers,
by calling build with SNPEBuilder::build() with the
setUseUserSuppliedBuffers() setter. This creates networks that
will use UserBuffers for execute(). By utilizing UserBuffer, a
user can specify the format (encoding) of the buffer and its
dimensionality. If the dimensions and stride of the buffer
matches the network’s, Qualcomm® Neural Processing SDK can potentially read from and write
to the buffers directly, saving data copies into / out of
tensors for each execute.

Copy Tensors

Qualcomm® Neural Processing SDK supports a STL compatible tensor class that is used to
send data into the network and return the output. While this
provides a great deal of flexibility and ability to leverage
STL functions to manipulate the data, it does come at a cost.
For tensors that contain relatively little data, exactly how
the user manipulates the data inside a tensor or gets data into
the tensor doesn’t really matter. However, for tensors that
need to contain a large amount of data (e.g. a 1080p input
image or very large outputs), the user should be aware of the
following guideline when moving data into a tensor: std::copy()
is far more efficient for moving data into or out of a tensor
than direct usage of the iterators (by at least an order of
magnitude more). So rather than doing something like the
following:

// Assume we have access to the following two variables
    // std::shared_ptr<zdl::DlSystem::ITensor> tensor;
    // std::vector<float>& vec;
    vec.resize(tensor->getSize());
    size_t idx = 0;
    for (auto it = tensor->begin(); it != tensor->end(); it++)
    {
            vec[idx++] = *it;
    }
    Copy to clipboard

The user should do this instead:

std::copy(tensor->begin(), tensor->end(), vec.begin())
    Copy to clipboard

This is true whether getting data from a tensor (as in the
example above) or putting data into a tensor.

In addition, if the tensor data needs to be modified (e.g.
pre-processed before going into the network or post-processed
after), it is better to do the manipulation in a user buffer
than in the tensor directly using the iterators (and then just
use std::copy() to move the modified data in/out of the
tensor).

Performance Tips for Executing Networks

- **Optimizing TensorFlow Graphs for Inference**

    - This applies only for TensorFlow.
    - TensorFlow provides a tool that can be used to convert a
model into one that is optimized for inference.
    - It is **strongly** recommended to optimize TensorFlow
graphs prior to converting them to a DLC file.
    - For an example of optimizing for inference, see
$SNPE\_ROOT/examples/Models/InceptionV3/scripts/setup\_inceptionv3\_snpe.py.
- **Balancing Performance and Power**

    - Qualcomm® Neural Processing SDK supports five performance profiles, “DEFAULT”,
“BALANCED, “HIGH\_PERFORMANCE”, “POWER\_SAVER” and
“SYSTEM\_SETTINGS”. (See
Snpe\_SNPEBuilder\_SetPerformanceProfile() API
description.)
    - The DEFAULT performance profile is less power intensive,
at the expense of performance.
    - The BALANCED performance profile is the same as DEFAULT.
(DEFAULT is going to be deprecated.)
    - The POWER\_SAVER performance profile attempts to provide
more power saving than the BALANCED performance profile,
which may result in lower performance.
    - For optimal performance, use the set the performance profile
to HIGH\_PERFORMANCE.

        - When HIGH\_PERFORMANCE is selected, Qualcomm® Neural Processing SDK will attempt
to maximize performance at the expense of increased
power consumption.
    - The SYSTEM\_SETTINGS profile causes Qualcomm® Neural Processing SDK to leave all
power and performance settings alone. No calls to any
power or performance related APIs will be invoked by
Qualcomm® Neural Processing SDK.

        - Users of this profile can use other APIs (out of the
scope of Qualcomm® Neural Processing SDK) if they want to control performance or
power.
- **Minimizing Profiling in Production Environments**

    - Qualcomm® Neural Processing SDK supports the
Snpe\_SNPEBuilder\_SetProfilingLevel()
API to configure the level of profiling information.
    - While the overhead of collecting profiling information is
small, it will still add to the inference time.
    - Disabling profiling information in production
environments will result in extra performance.
- **Running on the GPU**

    - Typically, running a network on the GPU results in a
6X-10X speed of inference increase as compared to running
the same network on the CPU and at lower power
consumption, so usually the GPU runtime is the obvious
choice for network execution unless the GPU is
potentially heavily utilized for some other application
(e.g. gaming).
    - However, there is a roughly 4-6ms overhead for network
execution on the GPU that does not exist on the CPU, so
very small networks might execute quicker on the CPU. For
example, if a network runs in less than 10ms on the GPU,
it may run faster on the CPU as the GPU overhead might
eliminate any speed advantage to the actual network
execution that the GPU provides.
    - By default, the GPU runtime runs in GPU\_FLOAT32\_16\_HYBRID
mode (Please see C Snpe\_Runtime\_t Enum
description).
The GPU\_FLOAT16 mode may run some networks faster but may
incur accuracy loss as well. (Please see [GPU
Limitations](https://docs.qualcomm.com/doc/80-63442-2/topic/limitations.html#general-gpu-runtime-limitations) section
for more info.)
- **Running on the DSP**

    - The DSP offers an optimized execution environment for
supported layers, however some layer operations are not
optimal on the DSP and may cause slow execution of the
model on the DSP.
    - The performance of input preprocessing layers are
currently not optimized on the DSP runtime. When using
the DSP runtime it is recommended to do input
preprocessing (colour space conversion, scaling, crop and
mean subtract) before passing the image to Qualcomm® Neural Processing SDK.
    - The DSP runs 8-bit quantized math for most operations.
Some networks may be sensitive to this and may not be
suitable for the DSP runtime.
    - The default DSP runtime availability check performs
platform validation on the DSP to validate DSP runtime
support. Basic runtime availability check performs less
validation than the default check, i.e. basic check only
validates that the SoC platform should have DSP support.
    - Accelerator Init Times are significantly longer for DSP
V68 version and above, compared to previous generation
platforms. The longer initialization times are due to the
graph analysis and optimization.
    - For DSP V68 version and above, enabling the init cache
mode is recommended. Subsequent initialization times will
be greatly reduced and execution times will also be
improved, due to data locality.

Last Published: Oct 02, 2025

[Previous Topic
Application Integration Tips](https://docs.qualcomm.com/bundle/publicresource/80-63442-2/topics/prog_integration.md) [Next Topic
Burst Mode on DSP and AIP](https://docs.qualcomm.com/bundle/publicresource/80-63442-2/topics/prog_burst_mode.md)