# Halide for HVX

Qualcomm Halide for HVX is used to target the HVX architecture in two ways:

- As a compiler for generating code to run on a Hexagon DSP with HVX
- As runtime libraries to support running the code compiled by the Halide
compiler on a Hexagon DSP with HVX

Compile Halide pipelines to generate a binary that is executed at runtime.
Halide for HVX supports two runtime targets:

- Qualcomm® Snapdragon™ devices (Device Offload and Device Standalone
modes)
- Hexagon Simulator (Simulator Offload and Simulator Standalone modes)

## [Halide execution modes](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id3)

You can generate two variants of Halide binaries: Offload mode and Standalone mode.

### [Offload mode](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id4)

The generated Halide binary is launched from a host processor.

On Snapdragon devices, the host process is the ARM applications processor.
This runtime model is also called the *Device Offload* mode. In this mode,
you do not have to deal with the underlying details of FastRPC
communication between the host processor and the Hexagon compute DSP (cDSP).
The Halide Device Offload mode runtime libraries take care of these details.

On the Hexagon simulator runtime target, the host processor is the x86
processor. This runtime model is called the *Simulator Offload* mode.

### [Standalone mode](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id5)

The generated Halide binary is a standalone HVX object file. Use this file
to integrate Halide pipelines into an existing Hexagon application.

On Snapdragon devices, the generated object file can be launched on the HVX
processor. This runtime mode is called the *Device Standalone* mode. Unlike
the Device Offload mode, in this mode, you handle the details of
communication between the host processor and the cDSP. All stages of the
pipeline execute on the cDSP, and you are responsible for authoring IDL
files for FastRPC from the host processor to the Hexagon DSP.

On the Hexagon simulator runtime target, the generated object file can be
used to simulate an end-to-end vision algorithm running on the Hexagon DSP
with HVX. This mode is called the *Simulator Standalone* mode.

## [Relevant target features for HVX](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id6)

Following are the target features that can be used in the `target` command
line argument to a generator ([Generators](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#generators)).

### [hvx](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id7)

Use the `hvx` target feature to instruct the Halide compiler to compile
a pipeline for HVX. The natively supported vector length is assumed to be
128 bytes.

Note

`hvx` is required to target the HVX ISA in all execution modes,
even when the host architecture in the target is the Hexagon architecture.

### [hvx_v66](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id8)

Use the `hvx_v66` target feature to instruct the Halide compiler to generate
code for HVX v66 ISA.

### [hvx_v68](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id9)

New in version 2.3.0.

Use the `hvx_v68` target feature to instruct the Halide compiler to generate
code for HVX v68 ISA. The following features are enabled by using this target
feature.

#### User DMA

HVX ISA v68 provides support for user-space Direct Memory Access (DMA).

Halide supports this feature in four execution modes. It can be requested by
using the `hexagon_user_dma` scheduling directive when the `hvx_v68` target
feature is used.

#### Vector floating point

Halide provides support for HVX floating points on the Hexagon DSP for both
`float16_t` half-precision (16-bit) and `float_t` single precision (32-bit)
data types, as well as conversion between floating point and integer data types.
Floating point vector support is only available on HVX ISA v68 or later and
only in 128-byte mode.

To enable this feature, add both the `hvx` and `hvx_v68` Halide target features
to the target when running your generator.

The following functions support both 16- and 32-bit floating point vectors:

sin(x)
    cos(x)
    tan(x)
    asin(x)
    acos(x)
    atan(x)
    sinh(x)
    cosh(x)
    tanh(x)
    exp(x)
    log(x)
    pow(x,y)
    sqrt(x)
    fast_inverse_sqrt(x)
    hypot(x,y)
    floor(x)
    ceil(x)
    trunc(x)
    round(x)
    is_inf(x)
    is_nan(x)
    is_finite(x)
    Copy to clipboard

The following functions support only 32-bit floating point vectors:

fast_exp(x)
    fast_log(x)
    Copy to clipboard

Using other functions will result in scalar code generation.
Support for more functions will be added in a future release.

### [hexagon_dma](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id10)

Use the `hexagon_dma` target feature in the two Standalone modes to use
Universal Bandwidth Compression Direct Memory Access (UBWCDMA).

### [profile](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id11)

Use the `profile` target feature in Device Offload and Device Standalone
modes to profile the Halide pipeline. For more information, see
[Target feature for profiling](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#targetfeatureforprofiling).

### [debug](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id12)

Use the `debug` target feature to enable debug messages in the Halide
pipeline and at runtime.

### [trace_realizations](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id13)

Use the `trace_realizations` target feature in Device Standalone mode to
profile the complete Halide pipeline as well as each individual `Func` that
is not scheduled inline.

The Halide runtime on Hexagon uses the Instrumented
Trace (ITRACE) library for gathering runtime statistics using the
Performance Monitoring Unit (PMU) on the CDSP.
For more information, see [Use Instrumented Trace (ITRACE) for profiling](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#itraceprofiling).

### [trace_pipeline](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id14)

Use the `trace_pipeline` target feature in Device Standalone mode to
profile a complete Halide pipeline only (not individual functions) using
Instrumented Trace (ITRACE) library.
For more information, see [Use Instrumented Trace (ITRACE) for profiling](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#itraceprofiling).

## [Performance on HVX](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id15)

Halide is unique in its separation of the algorithm from the schedule. Such a
separation allows you to freeze the algorithm and search through the schedules
to improve performance. However, there are some things you can do to fine tune
performance on the Hexagon DSP. In the following sections, the Halide pipeline
is simply called the pipeline or pipeline code. The C++ code that calls the
pipeline is called the application.

### [Alignment](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id16)

For best performance on the Hexagon DSP with HVX, align vector loads and stores
to the native vector width (128 bytes in 128-byte vector mode). For a vector
load to load from an aligned address, you need both the base address and the
offset from the base address to be multiples of the native vector width.

#### Alignment of external buffers

To ensure that a vector load (or store) from an extern buffer is aligned, tell
the Halide compiler that the base address of the external buffer is aligned.
This is done using the `set_host_alignment` directive in the pipeline (see
the example). Further, for a two dimensional buffer, ensure that the stride
of the outer dimension is aligned to the native vector width. In other words,
for a two dimensional buffer, ensure that every row is aligned to the native
vector width by using `set_stride`.

The following example shows the Halide pipeline code for aligning the host
pointer and strides of the external buffers.

Input<Buffer<uint8_t>> input{ "input", 2};
    Output<Buffer<uint8_t>> output{ "output", 2};
    // Schedule
    constexpr int vector_size = 128;
    
    // Set the stride of dimension 1 (y dimension) to be a multiple of the native
    // vector size.
    Expr input_stride = input.dim(1).stride();
    input.dim(1).set_stride((input_stride/vector_size) * vector_size);
    Expr output_stride = output.dim(1).stride();
    output.dim(1).set_stride((output_stride/vector_size) * vector_size);
    
    // Set the expected alignment of the host pointer in bytes.
    input.set_host_alignment(vector_size);
    output.set_host_alignment(vector_size);
    Copy to clipboard

When the Halide compiler generates code for a pipeline that uses
`set_host_alignment` and `set_stride`, the compiler will be able to more
efficiently determine the alignment of vector loads and stores. It will also
insert runtime asserts into the generated code that check the host alignment
and strides of the buffers that are passed at runtime. For example, at
runtime, if you pass a buffer that is not aligned to a 128-byte boundary,
the pipeline will crash with the error code, `halide_error_code_unaligned_host_ptr`.

Remember, the memory is allocated for an external buffer
(see [External Buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#externalbuffers)) by the caller of the pipeline. Thus, before
calling the pipeline in the C++ application, use the `Halide::Runtime::Buffer`
utility class to allocate a buffer whose host pointer and strides are aligned
to the natural vector width (as shown in the following example). When
memory is allocated using the device interface provided by the DSP, the host
pointer is automatically aligned to the natural vector width.

The following example shows the application code for aligning the host pointer
and strides of the external buffers.

#include "HalideRuntimeHexagonHost.h"
    #include "HalideBuffer.h"
    
    // Assume width and height are provided
    // already.
    // Align the stride
    const int VLEN=128;
    int stride_y = (width + (VLEN)-1) & (-(VLEN));
    
    // Define the dimensions
    halide_dimension_t x_dim{0, width, 1};
    halide_dimension_t y_dim{0, height, stride_y};
    halide_dimension_t io_shape[2] = {x_dim, y_dim};
    
    Halide::Runtime::Buffer<uint8_t> in(nullptr, 2, io_shape);
    Halide::Runtime::Buffer<uint8_t> out(nullptr, 2, io_shape);
    // The following call will use the device interface
    // provided by the DSP to allocate device memory which
    // is aligned to the natural vector width.
    in.device_malloc(halide_hexagon_device_interface());
    out.device_malloc(halide_hexagon_device_interface());
    Copy to clipboard

#### Alignment of internal buffers

To align internal pipeline stages, use the `align_storage`
([align\_storage](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#alignstorage)) scheduling directive.

The following example shows the Halide pipeline code that aligns internal buffers.

Input<Buffer<uint8_t>> input{ "input", 2};
    
    widened_input(x, y) = cast<int16_t>(input(x, y));
    // Pad the storage extent of the 'x' dimension of the storage allocated
    // for 'widened_input' to be a multiple of 64. This ensures that the
    // strides of dimensions outside 'x' are multiples of the specified alignment.
    // Strides and alignment are viewed in terms of number of elements here.
    widened_input
      .compute_at(Func(output), y)
      .align_storage(x, 64)
      .vectorize(x, vector_size, TailStrategy::RoundUp);
    Copy to clipboard

In this example, `widened_input` will have its rows aligned to an integral
multiple of 64 elements, where each element is two bytes long. This will ensure
that the rows are aligned to a 128-byte boundary, the natural vector width for
HVX.

#### Align when splitting dimensions

Use `TailStrategy::RoundUp` for a directive that splits a dimension regardless
of whether the split is explicit ([split](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#split)) or implicit ([vectorize](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#vectorize)).

The following example shows the Halide pipeline code that ensures alignment when
splitting a dimension.

Input<Buffer<uint8_t>> input{ "input", 2};
    
    widened_input(x, y) = cast<int16_t>(input(x, y));
    // Schedule:
    // vectorize splits the 'x' dimension into an inner dimension of size
    // vector_size and an outer dimension. If the extent of the outer dimension
    // is not a perfect multiple of 'vector_size', then the code below will ensure
    // that we round up to the next vector boundary. If, however, this is used on
    // stage that reads from or  writes to an external buffer, it constrains the
    // size of the external buffer to be a multiple of the split_factor (vector_size in
    // this case).
    widened_input
      .compute_at(Func(output), y)
      .align_storage(x, 64)
      .vectorize(x, vector_size, TailStrategy::RoundUp);
    Copy to clipboard

### [Memory locality](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id17)

Locality affects the latency of memory. Achieving good memory locality can
significantly improve the performance of a program.

#### Tiling

Use tiling to improve locality in Halide, as shown in the following example.

Input<Buffer<uint8_t>> input{ "input", 2};
    Func f{"f"}, g{"g"};
    Var x{"x"}, y{"y"};
    Var xi{"xi"}, yi{"yi"};
    
    f(x, y) = cast<uint16_t>(input(x-1, y)) - cast<uint16_t>(input(x+1, y));
    g(x, y) = f(x, y-1) + 2*f(x, y) + f(x, y+1);
    // Schedule:
    g.tile(x, y, xi, yi, 256, 32, TailStrategy::RoundUp)
    Copy to clipboard

This pipeline code is a producer-consumer relationship between `f` and `g`.
We have scheduled the computation of `g` in tiles of size 256 x 32 elements.
Further, `f` is now computed as required by a tile of `g`.

#### Line buffering

Two issues arise when tiling on HVX:

- Reasonably sized tiles can be smaller than two vectors in width because of the
large vector lengths supported by HVX.
- When tiling stencils, it is difficult for the producer functions to satisfy
native vector requirements to avoid scalarization.

The solution is to use line buffering to produce lines of the producer as
required by lines of the consumer.

The following example shows the Halide pipeline that uses line buffering.

Input<Buffer<uint8_t>> input{ "input", 2};
    Func f{"f"}, g{"g"};
    Var x{"x"}, y{"y"};
    Var xi{"xi"}, yi{"yi"};
    
    f(x, y) = cast<uint16_t>(input(x-1, y)) - cast<uint16_t>(input(x+1, y));
    g(x, y) = f(x, y-1) + 2*f(x, y) + f(x, y+1);
    // Schedule:
    f.store_root().compute_at(g, y);
    Copy to clipboard

As with tiling, this pipeline is a producer-consumer relationship between `f`
and `g`. For every line (row) of `g`, three lines of `f` are required. By
using `compute_at` ([compute\_at](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#computeat)), the pipeline is producing the number
of lines of `f` required per line of `g`. By allocating storage for `f`
at a higher level (loop in the loop nest) using `store_root`
([store\_root](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#storeroot)), these lines are buffered. This will ensure that in every
subsequent iteration of the loop traversing the `y` dimension of `g`, two
lines of `f` computed in the previous iteration are reused.

Halide generates the corresponding pseudo C/C++ code.

int height = g.y.extent;
    int width = g.x.extent;
    int g[height][width];
    int f[height+2][width];
    
    // For loop over the y dimension of g
    for (int y = 0; y < height; y++) {
       if(y==0)
         f[y-1][x] = cast<uint16_t>(input[y-1][x-1]) - cast<uint16_t>(input[y-1][x+1]);
         f[y][x] = cast<uint16_t>(input[y][x-1]) - cast<uint16_t>(input[y][x+1]);
       f[y+1][x] = cast<uint16_t>(input[y+1][x-1]) - cast<uint16_t>(input[y+1][x+1]);
    
       // For loop over the x dimension of g
       for (int x = 0; x < width; x++) {
         g[y][x] = f[y-1][x] + 2*f[y][x] + f[y+1][x];
       }
    }
    Copy to clipboard

### [Zero-copy buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id18)

Zero-copy buffers are allocated in memory that is visible to the host processor
and the Hexagon DSP. When working with zero-copy buffers, you do not pay the
penalty of copying data back and forth between the host CPU and the Hexagon DSP.

When allocating memory for a large external buffer and if the
`halide_device_interface` provided by the Hexagon DSP is used, zero-copy
memory is allocated by default. This is done by using the underlying ION memory
manager. The C++ convenience class `Halide::Runtime::Buffer` can be used as
shown in the following example.

Following is the application code that allocates zero-copy buffers for external
code using `Halide::Runtime::Buffer`.

#include "HalideRuntimeHexagonHost.h"
    #include "HalideBuffer.h"
    
    // Assume width and height are provided
    // already.
    // Align the stride
    const int VLEN=128;
    int stride_y = (width + (VLEN)-1) & (-(VLEN));
    
    // Define the dimensions
    halide_dimension_t x_dim{0, width, 1};
    halide_dimension_t y_dim{0, height, stride_y};
    halide_dimension_t io_shape[2] = {x_dim, y_dim};
    
    // The first argument is the pointer to data in main memory (host pointer)
    // If this is nullptr, then the halide_hexagon_device_interface
    // will use its device_malloc implementation to allocate an ion
    // memory buffer and also use it as the host pointer.
    Halide::Runtime::Buffer<uint8_t> in(nullptr, 2, io_shape);
    Halide::Runtime::Buffer<uint8_t> out(nullptr, 2, io_shape);
    // The following call will use the device interface
    // provided by the DSP to allocate device memory which
    // is aligned to the natural vector width.
    in.device_malloc(halide_hexagon_device_interface());
    out.device_malloc(halide_hexagon_device_interface());
    Copy to clipboard

#### rpcmem\_alloc and rpcmem\_free

If you are adding Halide to an existing application that already uses the
rpcmem library for allocating zero-copy buffers, use
`halide_hexagon_wrap_device_handle()` to attach the memory allocated with
`rpcmem_alloc` to a Halide buffer.

The following example illustrates using rpcmem with external buffers in Halide.

#include "rpcmem.h"
    #include "HalideRuntimeHexagonHost.h"

     halide_buffer_t input_buf;
     halide_buffer_t output_buf;
     rpcmem_init(0);
    
     // Allocate buffers
     // Over-allocate by one vector
     const int bufsize = stride * height + VLEN;
     input_buf.host     = (uint8_t*)rpcmem_alloc(25, RPCMEM_DEFAULT_FLAGS, bufsize);
     output_buf.host    = (uint8_t*)rpcmem_alloc(25, RPCMEM_DEFAULT_FLAGS, bufsize);
     if (input_buf.host == NULL || output_buf.host == NULL) {
         printf("Error: Cannot allocate memory\n");
         return 1;
     }
     halide_hexagon_wrap_device_handle(nullptr, &input_buf, input_buf.host, bufsize);
     halide_hexagon_wrap_device_handle(nullptr, &output_buf, output_buf.host, bufsize);
    
     // Free buffers
     rpcmem_free(input_buf.host);
     rpcmem_free(output_buf.host);
     rpcmem_deinit();
    Copy to clipboard

Note

Remember to over allocate by one vector. This is important because certain
optimizations in the Halide compiler assume they can read or write extern
buffers one vector past the end of the buffer.

### [VTCM and scatter-gather operations](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id19)

Vector Tightly Coupled Memory (VTCM) is fast vector memory available on Hexagon
ISA V65 and later. In Halide 2.4.x, HVX v65 is the lowest supported
HVX ISA. As such, this feature is available simply by the use of the `hvx`
target feature.

The following sections describe the two main use cases.

#### Internal (intermediate) buffers

Because VTCM is memory with very low latency, performance can be improved by
storing small internal buffers in it. This must be done very carefully as VTCM
memory is not large and is shared. Therefore, it is possible that some
allocations might not fit in VTCM. In such cases, the pipeline will exit with
the error code, `halide_error_code_vtcm_out_of_memory`. Typically, such issues are
fixed by reducing tile sizes.

Following is an example of using VTCM for intermediate buffers.

Input<Buffer<uint8_t>> input{ "input", 2};
    Output<Buffer<uint8_t>> output{ "output", 2};
    Func f{"f"};
    
    f(x, y) = 2*input(x, y);
    output(x, y) = f(x, y) + 2;
    
    f.compute_at(output, y)
     .store_in(MemoryType::VTCM)
     .vectorize(x, 128);
    Copy to clipboard

#### Scatter-gather operations

Scatter-gather operations in HVX work exclusively on data in the VTCM.

Note

Scatter instructions on the Hexagon DSP only operate on 16-bit and 32-bit
datatypes. To use these operations on 8-bit datatypes, add a stage
(`Func`) to cast the appropriate buffers to 16-bit datatypes before doing
the scatter. This can be followed by a cast back down
to an 8-bit datatype. For an example, see [Scatter operations on 8-bit data](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#eightbitscatter).

Gathers are simply table lookups in Halide. The following example shows the
Halide pipeline code doing a gather.

Input<Buffer<int16_t>> input{ "input", 1};
    Input<Buffer<int16_t>> lut{"lut", 1};
    Output<Buffer<int16_t>> output{ "output", 1};
    Func lut_vtcm{"lut_vtcm"}, output_vtcm{"output_vtcm"};
    Var x{"x"};
    
    lut_vtcm(x) = lut(x);
    output_vtcm(x) = lut_vtcm(clamp(input(x), 0, input.dim(0).extent()-1));
    output(x) = output_vtcm(x);
    
    lut_vtcm.compute_at(output, Var::outermost())
            .store_in(MemoryType::VTCM)
            .vectorize(x, 128);
    output_vtcm.compute_at(output, Var::outermost())
               .store_in(MemoryType::VTCM)
               .vectorize(x, 128);
    Copy to clipboard

This example is a typical lookup operation in Halide, where:

- `lut` is the look up table (LUT)
- `input` represents the indices to be looked up in the table
- `output` is the output buffer

For the Halide compiler to generate an HVX gather instruction, we have placed both
the output and the LUT (`lut`) by defining `output_vtcm` and `lut_vtcm`,
respectively. The `store_in` directive is then used to ensure that these
instructions are placed in VTCM.

Scatters, on the other hand, simply invert the relationship betwen `lut_vtcm`
and `output_vtcm` from the previous example. However, because the output buffer
might not have values defined for all indices, a pure definition stage is
required and then an update definition ([Pure and Update definitions](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#pureandupdatedefinitions)) is used
for the scatter.

The following example shows the Halide pipeline code doing a scatter.

Input<Buffer<int16_t>> input{"input", 1};
    Output<Buffer<int16_t>> output{"output", 1};
    Func output_vtcm{"output_vtcm"};
    Var x{"x"};
    
    // pure definition
    output_vtcm(x) = cast<int16_t>(0);
    // update definition
    output_vtcm(clamp(input(x), 0, input.dim(0).extent()-1)) = x*input(x);
    output(x) = output_vtcm(x);
    
    output_vtcm
               .allow_race_conditions()
               .compute_at(output, Var::outermost())
               .store_in(MemoryType::VTCM)
               .vectorize(x, 128);
    
    output_vtcm
               .update(0)
               .vectorize(x, 128);
    Copy to clipboard

If the values of `input` contain duplicates, then vectorizing the update
definition is essentially unsafe. However, if the programmer can guarantee that
there are no duplicates, such vectorization can be done by allowing the Halide
compiler to ignore race conditions (`allow_race_conditions`).

By changing a scatter to accumulate values, you can generate scatter-accumulates
which are supported by HVX ISA. This is extremely useful in computing
vectorized histograms. The following example shows the Halide pipeline code for a
scatter-accumulate.

Input<Buffer<int32_t>> input{"input", 1};
    Func histogram{"histogram"};
    
    // Pure definition.
    histogram(x) = cast<int16_t>(0);
    
    // Update definition - a reduction definition
    histogram(input(x)) += 1;
    Copy to clipboard

The following example shows the scheduling histograms for generating
scatter-accumulates.

histogram
       .allow_race_conditions()
       .compute_at(output, Var::outermost())
       .store_in(MemoryType::VTCM)
       .vectorize(x, 64);
    histogram
       .update(0)
       .vectorize(x, 64);
    Copy to clipboard

#### Scatter operations on 8-bit data

HVX supports scatter-gather instructions only on 16-bit and 32-bit data. Halide
uses 16-bit predicated vgather instructions to do gathers on 8-bit data.
However, to scatter 8-bit data, you must explicitly cast the data to a
wider type followed by casting to a narrower type after scattering.

Following is an example of scatter operation on 8-bit data.

Input<Buffer<uint8_t>> input{ "input", 1};
    Output<Buffer<uint8_t>> output{ "output", 1};
    Func f{"f"};
    Var x{"x"};
    
    // Cast to a wider type
    f(x) = cast<int16_t>(0);
    // Update definition
    f(clamp(input(x), 0, input.dim(0).extent()-1)) = 1;
    // Cast back to the narrower type
    output(x) = cast<uint8_t>(f(x));
    
    f.compute_at(output, Var::outermost())
     .store_in(MemoryType::VTCM)
     .vectorize(x, 128);
    Copy to clipboard

### [Power and performance APIs](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id20)

For increased control over the power and performance of your application,
specify a power level before powering on HVX. To do this, use the functions
provided by Halide to request (vote for) a specific performance mode. You can
also explicitly specify individual performance parameters if you do not want to
choose from one of the predefined power levels.

The following example shows the application-side API that specifies the power
level of the application (voting).

#include "HalideRuntimeHexagonHost.h"
    int halide_hexagon_set_performance_mode(void *user_context, halide_hexagon_power_mode_t mode, bool dcvs_enable=true);
    // user_context - Ignore. Can be NULL
    // mode         - Power level value similar to selecting a voltage corner
    //                as documented in the Hexagon SDK documentation.
    // dcvs_enable  - Turn DCVS participation on or off. The default value is True. For details about DCVS, see the Hexagon
    //                 SDK documentation
    Copy to clipboard

The following table lists UBWCDMA with the Halide Power APIs.

| Mode value | Equivalent voltage corner (See Hexagon SDK Documentation) | Alternate name of mode |
| --- | --- | --- |
| `halide_hexagon_power_low` | `HAP_DCVS_VCORNER_SVS` | `halide_hexagon_power_svs` |
| `halide_hexagon_power_nominal` | `HAP_DCVS_VCORNER_NOM` | `` `` |
| `halide_hexagon_power_turbo` | `HAP_DCVS_VCORNER_TURBO` | `` `` |
| `halide_hexagon_power_turbo_plus` | `HAP_DCVS_VCORNER_TURBO_PLUS` | `` `` |
| `halide_hexagon_power_turbo_l2` | `HAP_DCVS_VCORNER_TURBO_L2` | `` `` |
| `halide_hexagon_power_turbo_l3` | `HAP_DCVS_VCORNER_TURBO_L3` | `` `` |
| `halide_hexagon_power_max` | `HAP_DCVS_VCORNER_MAX` | `` `` |
| `halide_hexagon_power_default` | `HAP_DCVS_VCORNER_DISABLE` | `` `` |
| `halide_hexagon_power_low_plus` | `HAP_DCVS_VCORNER_SVSPLUS` | `halide_hexagon_power_svs_plus` |
| `halide_hexagon_power_low_2` | `HAP_DCVS_VCORNER_SVS2` | `halide_hexagon_power_svs2` |
| `halide_hexagon_power_nominal_plus` | `HAP_DCVS_VCORNER_NOMPLUS` | `` `` |

The alternative to using predefined power levels is to set individual
performance parameters instead. However, we recommend using the predefined
power-level approach over this one.

To set individual performance parameters, use
`halide_hexagon_set_performance`, as shown in the following application-side
API example.

#include "HalideRuntimeHexagonHost.h"
    halide_hexagon_power_t perf;
    perf.set_mips = 1;
    perf.mipsPerThread = 825;
    perf.mipsTotal = 1650;
    perf.set_bus_bw = 1;
    perf.bwMegabytesPerSec = 18750;
    perf.busbwUsagePercentage = 100;
    perf.set_latency = 1;
    perf.latency = 10;
    halide_hexagon_set_performance (NULL , &perf );
    Copy to clipboard

An application can be made to consume lower power than it would if the chosen
power mode is `halide_hexagon_power_low`, if you let the device run the clock
settings at a lower power level when the application is idle. Do this
by setting the power level to `halide_hexagon_power_default` after the
application is finished using the cDSP.
This usage is shown in the following application-side API example:

#include "HalideRuntimeHexagonHost.h"
    
    // To avoid the cost of powering HVX on in each call of the pipeline,
    // set performance mode to turbo and power HVX on once now.
    halide_hexagon_set_performance_mode(NULL, halide_hexagon_power_turbo);
    halide_hexagon_power_hvx_on(NULL);
    printf("Running pipeline...\n");
    double time = benchmark(iterations, 10, [&]() {
         int result = pipeline(&in, &out);
         if (result != 0) {
         }
    });
    // We're done with HVX for now, power it off.
    halide_hexagon_power_hvx_off(NULL);
    // Set performance mode back to the default for lower idle power usage
    halide_hexagon_set_performance_mode(NULL, halide_hexagon_power_default);
    Copy to clipboard

### [Profiling](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id21)

Profiling authored code allows you to understand the performance bottlenecks in
your programs. Halide for HVX provides a several methods of profiling depending
on the modes of execution.

#### Target feature for profiling

If the `profile` target feature is used in the target argument when running a
generator, the Halide compiler inserts code into the pipeline so that it can be
profiled at runtime.

This target feature is supported in both Device Offload and Device Standalone
modes. The Halide pipeline source code is not required to change. However, in
Device Standalone mode, you must call `halide_profiler_report` after the
call to your Halide pipeline.

For example, consider the following sample pipeline to profile and the
corresponding application code that calls the pipeline:

Expr height = input.height();
    bounded_input(x, y) = repeat_edge_x(input)(x, y);
    max_y(x, y) = max(bounded_input(x, clamp(y-1, 0, height-1)),
    bounded_input(x, clamp(y, 0, height-1)),
    bounded_input(x, clamp(y+1, 0, height-1)));
    output(x, y) = max(max_y(x-1, y), max_y(x, y), max_y(x+1, y));
    
    //Schedule
    if (get_target().has_feature(Target::HVX)) {
      Expr ht = output.dim(1).extent();
      bounded_input
         .compute_at(Func(output), y)
         .align_storage(x, 128)
         .vectorize(x, vector_size*2, TailStrategy::RoundUp);
      output
         .hexagon()
         .split(y, yo, y, ht/2)
         .tile(x, y, xi, yi, vector_size*2, 8, TailStrategy::RoundUp)
         .vectorize(xi)
         .unroll(yi);
      output.prefetch(input, y, 2);
      output.parallel(yo);
    Copy to clipboard

The following example shows the application code that enables Device Offload
mode profiling.

// Set up buffers for the pipeline as usual
    // Then, call the pipeline
    dilate3x3(input, output);
    
    // Now, just before finishing up, call halide_profiler_report
    halide_profiler_report(nullptr);
    Copy to clipboard

Now recompile and run the generator. If the target argument to the generator
was `arm-64-android-hvx`, use `arm-64-android-hvx-profile` for
profiling. Rebuild your application and run it as usual. To see the profiling
data, use `adb`.

$> adb logcat | grep halide
    07-15 22:35:31.112 13729 13729 I halide  : dilate3x3
    07-15 22:35:31.112 13729 13729 I halide  :  total time: 195.286346 ms  samples: 151  runs: 100  time/run: 1.952863 ms
    07-15 22:35:31.112 13729 13729 I halide  :  average threads used: 1.192053
    07-15 22:35:31.112 13729 13729 I halide  :  heap allocations: 0  peak heap usage: 0 bytes
    07-15 22:35:31.112 13729 13729 I halide  :   output:                1.212ms   (62%)   threads: 0.729
    07-15 22:35:31.112 13729 13729 I halide  :   bounded_input:         0.739ms   (37%)   threads: 2.000
    Copy to clipboard

Only the stages that have a schedule (not computed inline) appear in the
profile. In this profile example, `output` and `bounded_input` appear but
`max_y` does not because it is computed inline. In Device Standalone mode, the profile
output can be seen using QXDM or mini-dm.

#### Use hexagon-profiler for profiling

The Hexagon Tools are shipped with a profiling utility called
`hexagon-profiler`. This utility provides instruction-level profiler information
such as the number of stalls, types of stalls, and number of packet commits. You
can use this tool when Halide pipelines are run in Simulator Offload and Simulator
Standalone modes.

In Simulator Offload mode, before running the application, you must set the
`HL_HEXAGON_PACKET_ANALYZE` environment variable to the name of the file in
which you want the simulator to write profiling information. For example,
`HL_HEXAGON_PACKET_ANALYZE` is set to `profile.json`, and the simulator is
set to use Timing mode by enabling `HL_HEXAGON_TIMING`.

export HL_HEXAGON_PACKET_ANALYZE=profile.json
    export HL_HEXAGON_TIMING=1
    Copy to clipboard

When you run your application as shown in the following example, the
simulator will output a profile to `profile.json`, with the shell
command line to run to process `profile.json`.

To generate profile: hexagon-profiler --packet_analyze --json=profile.json
    --elf=libhalide_shared_runtimeT1530632564432956148P5181.so:0x68000 -o
    profile.html
    To generate profile: hexagon-profiler --packet_analyze --json=profile.json
    --elf=libhalide_hexagon_codeT1530632562748212127P5181.so:0x5f000 -o
    profile.html
    Copy to clipboard

To view profile information for the pipeline itself, run the second command in
the example. This will provide a visualization for the pipeline profile and not
the runtime library profile. To view the profile, open `profile.html` in a
browser.

In Simulator Standalone mode, use `--timing` and `--packet_analyze` when
running your application on `hexagon-sim` to get profile information that can
be fed to `hexagon-profiler`. To learn more about these options and the
`hexagon-profiler`, see the Hexagon Profiler User Guide that is part of the
Hexagon Document Bundle distributed with the Hexagon Tools.

#### Use Instrumented Trace (ITRACE) for profiling

New in version 2.5.0.

The ITRACE library allows you to monitor events in parallel on multiple domains
of Qualcomm devices. With the ITRACE library, you can register a set of Performance
Monitoring Unit (PMU) events on the cDSP. You can also identify one or more code
sections during which the registered PMU events will be monitored.

To profile Halide pipelines executing in Device Standalone mode, use the ITRACE
library by leveraging tracing in the Halide runtime. By using the
`trace_realizations` target feature, you can use ITRACE to profile an entire
Halide pipeline on HVX.

You can also profile individual stages that are not scheduled inline. While you
can use the `trace_pipeline` target feature to profile complete pipelines,
you can enable profiling on individual `Funcs` by using the
`trace_realizations` scheduling directive on them.

For example, consider the following sample pipeline to profile and the
application code:

Expr height = input.height();
    bounded_input(x, y) = repeat_edge(input)(x, y);
    max_y(x, y) = max(bounded_input(x, clamp(y-1, 0, height-1)),
    bounded_input(x, clamp(y, 0, height-1)),
    bounded_input(x, clamp(y+1, 0, height-1)));
    output(x, y) = max(max_y(x-1, y), max_y(x, y), max_y(x+1, y));
    
    //Schedule
    if (get_target().has_feature(Target::HVX)) {
      Expr ht = output.dim(1).extent();
      bounded_input
         .compute_at(Func(output), y)
         .align_storage(x, 128)
         .trace_realizations() //add to profile using ITRACE
         .vectorize(x, vector_size*2, TailStrategy::RoundUp);
      output
         .hexagon()
         .split(y, yo, y, ht/2)
         .tile(x, y, xi, yi, vector_size*2, 8, TailStrategy::RoundUp)
         .trace_realizations() //add to profile using ITRACE
         .vectorize(xi)
         .unroll(yi);
      output.prefetch(input, y, 2);
      output.parallel(yo);
    Copy to clipboard

As shown in this example, you can use the `trace_realizations` scheduling
directive to profile an individual `Func`. Alternatively, the
`trace_realizations` target feature can be added to the target when a
generator is run. For example:

- `hexagon-32-qurt-hvx-trace_realizations` – Will profile each individual
`Func` that has not been scheduled inline,
- - `hexagon-32-qurt-hvx-trace_pipeline` – Will profile the entire pipeline as
    - one entity.

To learn more about ITRACE, see the Hexagon SDK documentation.

## [Halide HVX built-ins](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id22)

New in version 2.5.0.

Halide built-ins for HVX allow for finer control over the HVX instructions
generated by the Halide compiler. Use these built-ins to improve the performance
of your pipeline. They ensure the generation of certain HVX instructions, which
would otherwise be hard for the compiler to generate out of typical Halide
expressions written by a user. For example, it is hard for the compiler to
generate the HVX instruction, `vror`, without using Halide built-ins for HVX.

Similarly, certain multiply-accumulate instructions can be reliably generated by
using the built-ins. For example, the following expression, when vectorized with
an appropriate factor, should result in the generation of `vmpa`:

// Assume input is of type UInt(8)
    output(x, y) = i16(input(x, y)) * 2 + input(x + 1, y) * 3
    Copy to clipboard

However, a more reliable way of generating the same `vmpa` instruction is by
using Halide built-ins for HVX:

hvx_builtin(Int(16), "add_2mpy.vub.vub.b.b", {input(x, y), input(x + 1, y), 2, 3});
    Copy to clipboard

The `hvx_builtin` function is provided for embedding specific HVX instructions
into the Halide pipeline. This gives you fine-grained control over code generation
by the compiler, while also providing all the benefits of scheduling directives.

For a complete example, see the `conv3x3` example in the Halide SDK
(`Halide/Examples/offload/apps/hexagon_benchmarks/conv3x3_generator.cpp`).
Run the generator by setting the `GeneratorParam` `use_builtins` to `true`.

Note

hvx\_builtin is supported only for the `hvx` target feature.

### [Supported Halide HVX built-ins](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id23)

This section includes a table that lists all the supported Halide HVX built-ins and
corresponding HVX instructions generated by the Halide compiler.

Some HVX instructions produce interleaved outputs while some instructions expect
deinterleaved inputs. When using these built-ins, do not to be concerned about
interleaving outputs or deinterleaving inputs. For example, a widening multiply
followed by a store to memory results in the interleaving of the result of the
multiply before the store. Thus, the following expression:

hvx_builtin(Int(16), "mpy.vb.vb", {i8_1, i8_2})
    Copy to clipboard

Is lowered to this Halide Intermediate Representation (IR):

(int16x128)halide.hexagon.interleave.vh((int16x128)halide.hexagon.mpy.vb.vb(i8_1, i8_2))
    Copy to clipboard

And subsequently generates the following assembly:

v7:6.h = vmpy(v6.b,v7.b)
    v3:2 = vshuff(v7,v6,r6)
    Copy to clipboard

Any stage using `hvx_builtin` must be vectorized.

Apart from the following table, there are two more built-ins:
`make_interleave` and `make_concat`, which are used to interleave and
concatenate vectors respectively. They can accept any number of arguments of
the same type and must have return types that are the same as the argument
types. These built-ins are meant to be used only inside arguments for other
`hvx_builtins`.

Types ending in `v` indicate that a vector is required.
For example: `i32` is a scalar type, while `i32v` is a vector of `i32`
elements.

Per the following table, to multiply a 16-bit signed value and a 16-bit unsigned
value and vectorize them, use one of the following approaches:

// i(x) is of type i16 and j(x) is of type u16.
    f(x) = i(x) * j(x)
    f.vectorize(x, 64);
    Copy to clipboard

or

f(x) = hvx_builtin(Int(32), "mpy.vh.vuh", {i(x), j(x)});
    f.vectorize(x, 64);
    Copy to clipboard

Both approaches should generate a `vmpy(vx.h,vy.uh)` HVX instruction.

| HVX built-in name | HVX instruction | Return type | Argument types |
| --- | --- | --- | --- |
| `zxt.vub` | `vzb` | u16v | u8v |
| `zxt.vuh` | `vzh` | u32v | u16v |
| `sxt.vb` | `vsb` | i16v | i8v |
| `sxt.vh` | `vsh` | i32v | i16v |
| `unpack.vub` | `vunpackub` | u16v | u8v |
| `unpack.vuh` | `vunpackuh` | u32v | u16v |
| `unpack.vb` | `vunpackb` | i16v | i8v |
| `unpack.vh` | `vunpackh` | i32v | i16v |
| `trunc.vh` | `vshuffeb` | i8v | i16v |
| `trunc.vw` | `vshufeh` | i16v | i32v |
| `trunclo.vh` | `vshuffob` | i8v | i16v |
| `trunclo.vw` | `vshufoh` | i16v | i32v |
| `trunc_satub.vh` | `vsathub` | u8v | i16v |
| `trunc_sath.vw` | `vsatwh` | i16v | i32v |
| `trunc_satuh.vuw` | `vsatuwuh` | u16v | u32v |
| `trunc_satub_rnd.vh` | `vroundhub` | u8v | i16v |
| `trunc_satb_rnd.vh` | `vroundhb` | i8v | i16v |
| `trunc_satub_rnd.vuh` | `vrounduhub` | u8v | u16v |
| `trunc_satuh_rnd.vw` | `vroundwuh` | u16v | i32v |
| `trunc_sath_rnd.vw` | `vroundwh` | i16v | i32v |
| `trunc_satuh_rnd.vuw` | `vrounduwuh` | u16v | u32v |
| `pack_satub.vh` | `vpackhub_sat` | u8v | i16v |
| `pack_satuh.vw` | `vpackwuh_sat` | u16v | i32v |
| `pack_satb.vh` | `vpackhb_sat` | i8v | i16v |
| `pack_sath.vw` | `vpackwh_sat` | i16v | i32v |
| `pack.vh` | `vpackeb` | i8v | i16v |
| `pack.vw` | `vpackeh` | i16v | i32v |
| `packhi.vh` | `vpackob` | i8v | i16v |
| `packhi.vw` | `vpackoh` | i16v | i32v |
| `add_vuh.vub.vub` | `vaddubh` | u16v | u8v, u8v |
| `add_vw.vh.vh` | `vaddhw` | i32v | i16v, i16v |
| `add_vuw.vuh.vuh` | `vadduhw` | u32v | u16v, u16v |
| `sub_vh.vub.vub` | `vsububh` | i16v | u8v, u8v |
| `sub_vw.vh.vh` | `vsubhw` | i32v | i16v, i16v |
| `sub_vw.vuh.vuh` | `vsubuhw` | i32v | u16v, u16v |
| `sat_add.vub.vub` | `vaddubsat` | u8v | u8v, u8v |
| `sat_add.vuh.vuh` | `vadduhsat` | u16v | u16v, u16v |
| `sat_add.vuw.vuw` | `vadduwsat` | u32v | u32v, u32v |
| `sat_add.vh.vh` | `vaddhsat` | i16v | i16v, i16v |
| `sat_add.vw.vw` | `vaddwsat` | i32v | i32v, i32v |
| `sat_add.vub.vub.dv` | `vaddubsat_dv` | u8v | u8v, u8v |
| `sat_add.vuh.vuh.dv` | `vadduhsat_dv` | u16v | u16v, u16v |
| `sat_add.vuw.vuw.dv` | `vadduwsat_dv` | u32v | u32v, u32v |
| `sat_add.vh.vh.dv` | `vaddhsat_dv` | i16v | i16v, i16v |
| `sat_add.vw.vw.dv` | `vaddwsat_dv` | i32v | i32v, i32v |
| `sat_sub.vub.vub` | `vsububsat` | i8v | u8v, u8v |
| `sat_sub.vuh.vuh` | `vsubuhsat` | i16v | u16v, u16v |
| `sat_sub.vh.vh` | `vsubhsat` | i16v | i16v, i16v |
| `sat_sub.vw.vw` | `vsubwsat` | i32v | i32v, i32v |
| `sat_sub.vub.vub.dv` | `vsububsat_dv` | i8v | u8v, u8v |
| `sat_sub.vuh.vuh.dv` | `vsubuhsat_dv` | i16v | u16v, u16v |
| `sat_sub.vh.vh.dv` | `vsubhsat_dv` | i16v | i16v, i16v |
| `sat_sub.vw.vw.dv` | `vsubwsat_dv` | i32v | i32v, i32v |
| `abs.vh` | `vabsh` | u16v | i16v |
| `abs.vw` | `vabsw` | u32v | i32v |
| `abs.vb` | `vabsb` | u8v | i8v |
| `absd.vub.vub` | `vabsdiffub` | u8v | u8v, u8v |
| `absd.vuh.vuh` | `vabsdiffuh` | u16v | u16v, u16v |
| `absd.vh.vh` | `vabsdiffh` | u16v | i16v, i16v |
| `absd.vw.vw` | `vabsdiffw` | u32v | i32v, i32v |
| `avg.vub.vub` | `vavgub` | u8v | u8v, u8v |
| `avg.vuh.vuh` | `vavguh` | u16v | u16v, u16v |
| `avg.vuw.vuw` | `vavguw` | u32v | u32v, u32v |
| `avg.vb.vb` | `vavgb` | i8v | i8v, i8v |
| `avg.vh.vh` | `vavgh` | i16v | i16v, i16v |
| `avg.vw.vw` | `vavgw` | i32v | i32v, i32v |
| `avg_rnd.vub.vub` | `vavgubrnd` | u8v | u8v, u8v |
| `avg_rnd.vuh.vuh` | `vavguhrnd` | u16v | u16v, u16v |
| `avg_rnd.vuw.vuw` | `vavguwrnd` | u32v | u32v, u32v |
| `avg_rnd.vb.vb` | `vavgbrnd` | i8v | i8v, i8v |
| `avg_rnd.vh.vh` | `vavghrnd` | i16v | i16v, i16v |
| `avg_rnd.vw.vw` | `vavgwrnd` | i32v | i32v, i32v |
| `navg.vub.vub` | `vnavgub` | i8v | u8v, u8v |
| `navg.vb.vb` | `vnavgb` | i8v | i8v, i8v |
| `navg.vh.vh` | `vnavgh` | i16v | i16v, i16v |
| `navg.vw.vw` | `vnavgw` | i32v | i32v, i32v |
| `mul.vh.vh` | `vmpyih` | i16v | i16v, i16v |
| `mul.vh.b` | `vmpyihb` | i16v | i16v, i8 |
| `mul.vw.h` | `vmpyiwh` | i32v | i32v, i16 |
| `mul.vw.b` | `vmpyiwb` | i32v | i32v, i8 |
| `add_mul.vh.vh.vh` | `vmpyih_acc` | i16v | i16v, i16v, i16v |
| `add_mul.vh.vh.b` | `vmpyihb_acc` | i16v | i16v, i16v, i8 |
| `add_mul.vw.vw.h` | `vmpyiwh_acc` | i32v | i32v, i32v, i16 |
| `add_mul.vw.vw.b` | `vmpyiwb_acc` | i32v | i32v, i32v, i8 |
| `mpy.vub.vub` | `vmpyubv` | u16v | u8v, u8v |
| `mpy.vuh.vuh` | `vmpyuhv` | u32v | u16v, u16v |
| `mpy.vb.vb` | `vmpybv` | i16v | i8v, i8v |
| `mpy.vh.vh` | `vmpyhv` | i32v | i16v, i16v |
| `add_mpy.vuh.vub.vub` | `vmpyubv_acc` | u16v | u16v, u8v, u8v |
| `add_mpy.vuw.vuh.vuh` | `vmpyuhv_acc` | u32v | u32v, u16v, u16v |
| `add_mpy.vh.vb.vb` | `vmpybv_acc` | i16v | i16v, i8v, i8v |
| `add_mpy.vw.vh.vh` | `vmpyhv_acc` | i32v | i32v, i16v, i16v |
| `mpy.vub.vb` | `vmpybusv` | i16v | u8v, i8v |
| `mpy.vh.vuh` | `vmpyhus` | i32v | i16v, u16v |
| `add_mpy.vh.vub.vb` | `vmpybusv_acc` | i16v | i16v, u8v, i8v |
| `add_mpy.vw.vh.vuh` | `vmpyhus_acc` | i32v | i32v, i16v, u16v |
| `mpy.vub.ub` | `vmpyub` | u16v | u8v, u8 |
| `mpy.vuh.uh` | `vmpyuh` | u32v | u16v, u16 |
| `mpy.vh.h` | `vmpyh` | i32v | i16v, i16 |
| `mpy.vub.b` | `vmpybus` | i16v | u8v, i8 |
| `add_mpy.vuh.vub.ub` | `vmpyub_acc` | u16v | u16v, u8v, u8 |
| `add_mpy.vuw.vuh.uh` | `vmpyuh_acc` | u32v | u32v, u16v, u16 |
| `add_mpy.vh.vub.b` | `vmpybus_acc` | i16v | i16v, u8v, i8 |
| `satw_add_mpy.vw.vh.h` | `vmpyhsat_acc` | i32v | i32v, i16v, i16 |
| `add_4mpy.vub.vub` | `vrmpyubv` | u32v | u8v, u8v |
| `add_4mpy.vb.vb` | `vrmpybv` | i32v | i8v, i8v |
| `add_4mpy.vub.vb` | `vrmpybusv` | i32v | u8v, i8v |
| `acc_add_4mpy.vuw.vub.vub` | `vrmpyubv_acc` | u32v | u32v, u8v, u8v |
| `acc_add_4mpy.vw.vb.vb` | `vrmpybv_acc` | i32v | i32v, i8v, i8v |
| `acc_add_4mpy.vw.vub.vb` | `vrmpybusv_acc` | i32v | i32v, u8v, i8v |
| `add_2mpy.vub.b` | `vdmpybus` | i16v | u8v, i32 |
| `add_2mpy.vh.b` | `vdmpyhb` | i32v | i16v, i32 |
| `acc_add_2mpy.vh.vub.b` | `vdmpybus_acc` | i16v | i16v, u8v, i32 |
| `acc_add_2mpy.vw.vh.b` | `vdmpyhb_acc` | i32v | i32v, i16v, i32 |
| `add_2mpy.vh.h` | `vdmpyhsat` | i32v | i16v, i32 |
| `add_2mpy.vh.uh` | `vdmpyhsusat` | i32v | i16v, u32 |
| `add_2mpy.vh.vh` | `vdmpyhvsat` | i32v | i16v, i16v |
| `add_3mpy.vub.b` | `vtmpybus` | i16v | u8v, i32 |
| `add_3mpy.vb.b` | `vtmpyb` | i16v | i8v, i32 |
| `add_3mpy.vh.b` | `vtmpyhb` | i32v | u16v, i32 |
| `acc_add_3mpy.vh.vub.b` | `vtmpybus_acc` | i16v | i16v, u8v, i32 |
| `acc_add_3mpy.vh.vb.b` | `vtmpyb_acc` | i16v | i16v, i8v, i32 |
| `acc_add_3mpy.vw.vh.b` | `vtmpyhb_acc` | i32v | i32v, u16v, i32 |
| `add_4mpy.vub.b` | `vrmpybus` | i32v | u8v, i32 |
| `add_4mpy.vub.ub` | `vrmpyub` | u32v | u8v, u32 |
| `acc_add_4mpy.vw.vub.b` | `vrmpybus_acc` | i32v | i32v, u8v, i32 |
| `acc_add_4mpy.vuw.vub.ub` | `vrmpyub_acc` | u32v | u32v, u8v, u32 |
| `trunc_satw_mpy2_rnd.vh.vh` | `vmpyhvsrs` | i16v | i16v, i16v |
| `trunc_satw_mpy2.vh.h` | `vmpyhss` | i16v | i16v, i16 |
| `trunc_satw_mpy2_rnd.vh.h` | `vmpyhsrs` | i16v | i16v, i16 |
| `max.vub.vub` | `vmaxub` | u8v | u8v, u8v |
| `max.vuh.vuh` | `vmaxuh` | u16v | u16v, u16v |
| `max.vh.vh` | `vmaxh` | i16v | i16v, i16v |
| `max.vw.vw` | `vmaxw` | i32v | i32v, i32v |
| `max.vhf.vhf` | `vmax_hf` | f16v | f16v, f16v |
| `max.vsf.vsf` | `vmax_sf` | f32v | f32v, f32v |
| `min.vub.vub` | `vminub` | u8v | u8v, u8v |
| `min.vuh.vuh` | `vminuh` | u16v | u16v, u16v |
| `min.vh.vh` | `vminh` | i16v | i16v, i16v |
| `min.vw.vw` | `vminw` | i32v | i32v, i32v |
| `min.vhf.vhf` | `vmin_hf` | f16v | f16v, f16v |
| `min.vsf.vsf` | `vmin_sf` | f32v | f32v, f32v |
| `shr.vuh.vh` | `vlsrhv` | u16v | u16v, u16v |
| `shr.vuw.vw` | `vlsrwv` | u32v | u32v, u32v |
| `shr.vh.vh` | `vasrhv` | i16v | i16v, u16v |
| `shr.vw.vw` | `vasrwv` | i32v | i32v, u32v |
| `trunc_satub_shr_rnd.vh` | `vasrhubrndsat` | u8v | i16v, u16 |
| `trunc_satb_shr_rnd.vh` | `vasrhbrndsat` | i8v | i16v, u16 |
| `trunc_satub_shr_rnd.vuh` | `vasruhubrndsat` | u8v | u16v, u16 |
| `trunc_satuh_shr_rnd.vw` | `vasrwuhrndsat` | u16v | i32v, u32 |
| `trunc_sath_shr_rnd.vw` | `vasrwhrndsat` | i16v | i32v, u32 |
| `trunc_satuh_shr_rnd.vuw` | `vasruwuhrndsat` | u16v | u32v, u32 |
| `shl.vuh.vh` | `vaslhv` | u16v | u16v, u16v |
| `shl.vuw.vw` | `vaslwv` | u32v | u32v, u32v |
| `shl.vh.vh` | `vaslhv` | i16v | i16v, u16v |
| `shl.vw.vw` | `vaslwv` | i32v | i32v, u32v |
| `shr.vuh.h` | `vlsrh` | u16v | u16v, u16 |
| `shr.vuw.w` | `vlsrw` | u32v | u32v, u32 |
| `shr.vh.h` | `vasrh` | i16v | i16v, u16 |
| `shr.vw.w` | `vasrw` | i32v | i32v, u32 |
| `shl.vuh.h` | `vaslh` | u16v | u16v, u16 |
| `shl.vuw.w` | `vaslw` | u32v | u32v, u32 |
| `shl.vh.h` | `vaslh` | i16v | i16v, u16 |
| `shl.vw.w` | `vaslw` | i32v | i32v, u32 |
| `add_shr.vh.vh.uh` | `vasrh_acc` | i16v | i16v, i16v, i16 |
| `add_shl.vh.vh.uh` | `vaslh_acc` | i16v | i16v, i16v, i16 |
| `add_shr.vw.vw.uw` | `vasrw_acc` | i32v | i32v, i32v, i32 |
| `add_shl.vw.vw.uw` | `vaslw_acc` | i32v | i32v, i32v, i32 |
| `trunc_shr.vw.uw` | `vasrwh` | i16v | i32v, u32 |
| `trunc_satub_shr.vh.uh` | `vasrhubsat` | u8v | i16v, u16 |
| `trunc_satuh_shr.vw.uw` | `vasrwuhsat` | u16v | i32v, u32 |
| `trunc_sath_shr.vw.uw` | `vasrwhsat` | i16v | i32v, u32 |
| `vror` | `vror` | u8v | u8v, i32 |
| `cls.vh` | `vnormamth` | u16v | u16v |
| `cls.vw` | `vnormamtw` | u32v | u32v |
| `mul.vw.vw` | Custom Halide implementation: `vmpyieoh + vmpyiewuh_acc` | i32v | i32v, i32v |
| `mul.vw.vh` | Custom Halide implementation: `vaslw + vmpyiowh + vmpyiowh` | i32v | i32v, i16v |
| `mul.vw.vuh` | Custom Halide implementation: `vlsrw + vmpyiewuh + vmpyiewuh` | i32v | i32v, u16v |
| `mul.vuw.vuh` | Custom Halide implementation: `vshufeh + vshufoh + vmpyuhv + vmpyuhv + vaslw` | u32v | u32v, u16v |
| `mul.vuw.vuw` | Custom Halide implementation: `vshufeh + vshufeh + vshufoh + vshufoh + vmpyuhv + vmpyuhv + vmpyuhv_acc` | u32v | u32v, u32v |
| `trunc_mpy.vw.vw` | Custom Halide implementation: `vmpyewuh + vmpyowh_sacc + vasrw` | i32v | i32v, i32v |
| `trunc_satdw_mpy2.vw.vw` | Custom Halide implementation: `vmpyewuh + vmpyowh_sacc` | i32v | i32v, i32v |
| `trunc_satdw_mpy2_rnd.vw.vw` | Custom Halide implementation: `vmpyewuh + vmpyowh_rnd_sacc` | i32v | i32v, i32v |
| `shl.vub.b` | Custom Halide implementation: `vzb + vaslh + vaslh + vshuffeb` | i8v | u8v, i8 |
| `shl.vb.b` | Custom Halide implementation: `vzb + vaslh + vaslh + vshuffeb` | i8v | i8v, i8 |
| `shr.vub.b` | Custom Halide implementation: `vlsrh + vlsrh + vshuffeb` | i8v | u8v, i8 |
| `shr.vb.b` | Custom Halide implementation: `vasrh + vasrh + vshuffeb` | i8v | i8v, i8 |
| `shl.vub.vb` | Custom Halide implementation: `vzb + vsb + vaslhv + vaslhv + vshuffeb` | i8v | u8v, i8v |
| `shl.vb.vb` | Custom Halide implementation: shl.vub.vb`` | i8v | i8v, i8v |
| `shr.vub.vb` | Custom Halide implementation: `vzb + vsb + vlsrhv + vlsrhv + vshuffeb` | i8v | u8v, i8v |
| `shr.vb.vb` | Custom Halide implementation: `vsb + vsb + vasrhv + vasrhv + vshuffeb` | i8v | i8v, i8v |
| `add_2mpy.vub.vub.b.b` | Custom Halide implementation: `vmpabus` | i16v | u8v, u8v, i8, i8 |
| `acc_add_2mpy.vh.vub.vub.b.b` | Custom Halide implementation: `vmpabus_acc` | i16v | i16v, u8v, u8v, i8, i8 |
| `add_2mpy.vh.vh.b.b` | Custom Halide implementation: `vmpahb` | i32v | i16v, i16v, i8, i8 |
| `acc_add_2mpy.vw.vh.vh.b.b` | Custom Halide implementation: `vmpahb_acc` | i32v | i32v, i16v, i16v, i8, i8 |
| `trunc_satuh.vw` | Custom Halide implementation: `vasrwuhsat` | i16v | i32v |
| `vtmpy.vub.vub.b.b` | Custom Halide implementation: `vtmpybus` | i16v | u8v, u8v, i8, i8 |
| `vtmpy.vb.vb.b.b` | Custom Halide implementation: `vtmpyb` | i16v | i8v, i8v, i8, i8 |
| `vtmpy.vh.vh.b.b` | Custom Halide implementation: `vtmpyhb` | i32v | i16v, i16v, i8, i8 |
| `vrmpy_odd.vub.vub.w` | Custom Halide implementation: `vrmpybusi with immediate 1` | i32v | u8v, u8v, i32 |
| `acc_vrmpy_odd.vw.vub.vub.w` | Custom Halide implementation: `vrmpybusi_acc with immediate 1` | i32v | i32v, u8v, u8v, i32 |
| `vrmpy_even.vub.vub.w` | Custom Halide implementation: `vrmpybusi with immediate 0` | i32v | u8v, u8v, i32 |
| `acc_vrmpy_even.vw.vub.vub.w` | Custom Halide implementation: `vrmpybusi_acc with immediate 0` | i32v | i32v, u8v, u8v, i32 |
| `add_4mpy.vub.b.stencil` | Custom Halide implementation: `vrmpybusi + vrmpybusi` | i32v | u8v, i32 |
| `add_4mpy.vub.ub.stencil` | Custom Halide implementation: `vrmpyubi + vrmpyubi` | u32v | u8v, u32 |

### [Usage and examples for HVX built-ins](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id24)

Usage:

hvx_builtin(<return type such as Int(8), UInt(8), Int(16),.....>
                <builtin_name>, <vector of arguments with correct types>)
    Copy to clipboard

Use the following Halide expressions to interpret the table:

Expr i8_1 = in_i8(x), i8_2 = in_i8(x + 16), i8_3 = in_i8(x + 32), i8_4 = in_i8(x + 48);
    Expr u8_1 = in_u8(x), u8_2 = in_u8(x + 16), u8_3 = in_u8(x + 32), u8_4 = in_u8(x + 48);
    Expr i16_1 = in_i16(x), i16_2 = in_i16(x + 16), i16_3 = in_i16(x + 32);
    Expr u16_1 = in_u16(x), u16_2 = in_u16(x + 16), u16_3 = in_u16(x + 32);
    Expr i32_1 = in_i32(x), i32_2 = in_i32(x + 16), i32_3 = in_i32(x + 32);
    Expr u32_1 = in_u32(x), u32_2 = in_u32(x + 16), u32_3 = in_u32(x + 32);
    
    Expr u8_4x_0 = in_u8(rfac * x + 0);
    Expr u8_4x_1 = in_u8(rfac * x + 1);
    Expr u8_4x_2 = in_u8(rfac * x + 2);
    Expr u8_4x_3 = in_u8(rfac * x + 3);
    Expr u8_4x_32 = in_u8(rfac * x + 32 + 0);
    Expr u8_4x_33 = in_u8(rfac * x + 32 + 1);
    Expr u8_4x_34 = in_u8(rfac * x + 32 + 2);
    Expr u8_4x_35 = in_u8(rfac * x + 32 + 3);
    Expr i8_4x_0 = in_i8(rfac * x + 0);
    Expr i8_4x_1 = in_i8(rfac * x + 1);
    Expr i8_4x_2 = in_i8(rfac * x + 2);
    Expr i8_4x_3 = in_i8(rfac * x + 3);
    Expr i8_4x_32 = in_i8(rfac * x + 32 + 0);
    Expr i8_4x_33 = in_i8(rfac * x + 32 + 1);
    Expr i8_4x_34 = in_i8(rfac * x + 32 + 2);
    Expr i8_4x_35 = in_i8(rfac * x + 32 + 3);
    Expr tmp = hvx_builtin(UInt(8), "make_interleave", {u8_4x_0, u8_4x_1, u8_4x_2, u8_4x_3});
    Expr tmp_32 = hvx_builtin(UInt(8), "make_interleave", {u8_4x_32, u8_4x_33, u8_4x_34, u8_4x_35});
    Expr tmp2 = hvx_builtin(Int(8), "make_interleave", {i8_4x_0, i8_4x_1, i8_4x_2, i8_4x_3});
    Expr tmp34 = hvx_builtin(Int(8), "make_interleave", {i8_4x_32, i8_4x_33, i8_4x_34, i8_4x_35});
    Copy to clipboard

The following table shows sample uses for `hvx_builtin`.

| Assembly | Halide expr. | HVX built-in equivalent |
| --- | --- | --- |
| `vror(Vx,r)` | `in_u8(x / 128 + (x + 5) % 128)` | `hvx_builtin(UInt(8), "vror.ub", {u8_1, 5})` |
| `vror(Vx,r)` | `in_u16(x / 64 + (x + 7) % 64)` | `hvx_builtin(UInt(16), "vror.uh", {u8_1, 7})` |
| `vror(Vx,r)` | `in_u32(x / 32 + (x + 9) % 32)` | `hvx_builtin(UInt(32), "vror.uw", {u8_1, 9})` |
| `vror(Vx,r)` | `in_i8(x / 128 + (x + 2) % 128)` | `hvx_builtin(Int(8), "vror.b", {i8_1, 2})` |
| `vror(Vx,r)` | `in_i16(x / 64 + (x + 3) % 64)` | `hvx_builtin(Int(16), "vror.h", {i8_1, 3})` |
| `vror(Vx,r)` | `in_i32(x / 32 + (x + 4) % 32)` | `hvx_builtin(Int(32), "vror.w", {i8_1, 4})` |
| `vunpack(Vx.ub)` | `u16(u8_1)` | `hvx_builtin(UInt(16), "unpack.vub", {u8_1})` |
| `vunpack(Vx.ub)` | `i16(u8_1)` | `i16(hvx_builtin(UInt(16), "unpack.vub", {u8_1}))` |
| `vunpack(Vx.uh)` | `u32(u16_1)` | `hvx_builtin(UInt(32), "unpack.vuh", {u16_1})` |
| `vunpack(Vx.uh)` | `i32(u16_1)` | `i32(hvx_builtin(UInt(32), "unpack.vuh", {u16_1}))` |
| `vunpack(Vx.b)` | `u16(i8_1)` | `u16(hvx_builtin(Int(16), "unpack.vb", {i8_1}))` |
| `vunpack(Vx.b)` | `i16(i8_1)` | `hvx_builtin(Int(16), "unpack.vb", {i8_1})` |
| `vunpack(Vx.h)` | `u32(i16_1)` | `u32(hvx_builtin(Int(32), "unpack.vh", {i16_1}))` |
| `vunpack(Vx.h)` | `i32(i16_1)` | `hvx_builtin(Int(32), "unpack.vh", {i16_1})` |
| `vunpack(Vx.ub)` | `u32(u8_1)` | `hvx_builtin(UInt(32), "unpack.vub", {u8_1})` |
| `vunpack(Vx.ub)` | `i32(u8_1)` | `hvx_builtin(Int(32), "unpack.vub", {u8_1})` |
| `vunpack(Vx.b)` | `u32(i8_1)` | `u32(hvx_builtin(Int(16), "unpack.vb", {i8_1}))` |
| `vunpack(Vx.b)` | `i32(i8_1)` | `hvx_builtin(Int(32), "unpack.vb", {i8_1})` |
| `Vx.h = vadd(Vy.ub,Vz.ub)` | `u16(u8_1) + u16(u8_2)` | `hvx_builtin(UInt(16), "add_vuh.vub.vub", {u8_1, u8_2})` |
| `Vx.w = vadd(Vy.uh,Vz.uh)` | `u32(u16_1) + u32(u16_2)` | `hvx_builtin(UInt(32), "add_vuw.vuh.vuh", {u16_1, u16_2})` |
| `Vx.w = vadd(Vy.h,Vz.h)` | `i32(i16_1) + i32(i16_2)` | `hvx_builtin(Int(32), "add_vw.vh.vh", {i16_1, i16_2})` |
| `vadd(Vx.ub,Vy.ub):sat` | `u8_sat(u16(u8_1) + u16(u8_2))` | `hvx_builtin(UInt(8), "sat_add.vub.vub", {u8_1, u8_2})` |
| `vadd(Vx.uh,Vy.uh):sat` | `u16_sat(u32(u16_1) + u32(u16_2))` | `hvx_builtin(UInt(16), "sat_add.vuh.vuh", {u16_1, u16_2})` |
| `vadd(Vx.h,Vy.h):sat` | `i16_sat(i32(i16_1) + i32(i16_2))` | `hvx_builtin(Int(16), "sat_add.vh.vh", {i16_1, i16_2})` |
| `vadd(Vx.w,Vy.w):sat` | `i32_sat(i64(i32_1) + i64(i32_2))` | `hvx_builtin(Int(32), "sat_add.vw.vw", {i32_1, i32_2})` |
| `vadd(Vx.uw,Vy.uw):sat` | `u32_sat(u64(u32_1) + u64(u32_2))` | `hvx_builtin(UInt(32), "sat_add.vuw.vuw", {u32_1, u32_2})` |
| `Vx.h = vsub(Vy.ub,Vz.ub)` | `u16(u8_1) - u16(u8_2)` | `u16(hvx_builtin(Int(16), "sub_vh.vub.vub", {u8_1, u8_2}))` |
| `Vxx.h = vsub(Vy.ub,Vz.ub)` | `i16(u8_1) - i16(u8_2)` | `hvx_builtin(Int(16), "sub_vh.vub.vub", {u8_1, u8_2})` |
| `Vx.w = vsub(Vy.uh,Vz.uh)` | `u32(u16_1) - u32(u16_2)` | `u32(hvx_builtin(Int(32), "sub_vw.vuh.vuh", {u16_1, u16_2}))` |
| `Vxx.w = vsub(Vy.uh,Vz.uh)` | `i32(u16_1) - i32(u16_2)` | `hvx_builtin(Int(32), "sub_vw.vuh.vuh", {u16_1, u16_2})` |
| `Vx.w = vsub(Vy.h,Vz.h)` | `i32(i16_1) - i32(i16_2)` | `hvx_builtin(Int(32), "sub_vw.vh.vh", {i16_1, i16_2})` |
| `vsub(Vx.ub,Vy.ub):sat` | `u8_sat(i16(u8_1) - i16(u8_2))` | `u8(hvx_builtin(Int(8), "sat_sub.vub.vub", {u8_1, u8_2}))` |
| `vsub(Vx.uh,Vy.uh):sat` | `u16_sat(i32(u16_1) - i32(u16_2))` | `u16(hvx_builtin(Int(16), "sat_sub.vuh.vuh", {u16_1, u16_2}))` |
| `vsub(Vx.h,Vy.h):sat` | `i16_sat(i32(i16_1) - i32(i16_2))` | `hvx_builtin(Int(16), "sat_sub.vh.vh", {i16_1, i16_2})` |
| `vsub(Vx.w,Vy.w):sat` | `i32_sat(i64(i32_1) - i64(i32_2))` | `hvx_builtin(Int(32), "sat_sub.vw.vw", {i32_1, i32_2})` |
| `vadd(Vxx.ub,Vyy.ub):sat` | `u8_sat(u16(u8_1) + u16(u8_2))` | `hvx_builtin(UInt(8), "sat_add.vub.vub.dv", {u8_1, u8_2})` |
| `vadd(Vxx.uh,Vyy.uh):sat` | `u16_sat(u32(u16_1) + u32(u16_2))` | `hvx_builtin(UInt(16), "sat_add.vuh.vuh.dv", {u16_1, u16_2})` |
| `vadd(Vxx.h,Vyy.h):sat` | `i16_sat(i32(i16_1) + i32(i16_2))` | `hvx_builtin(Int(16), "sat_add.vh.vh.dv", {i16_1, i16_2})` |
| `vadd(Vxx.w,Vyy.w):sat` | `i32_sat(i64(i32_1) + i64(i32_2))` | `hvx_builtin(Int(32), "sat_add.vw.vw.dv", {i32_1, i32_2})` |
| `vadd(Vxx.uw,Vyy.uw):sat` | `u32_sat(u64(u32_1) + u64(u32_2))` | `hvx_builtin(UInt(32), "sat_add.vuw.vuw.dv", {u32_1, u32_2})` |
| `vsub(Vxx.ub,Vyy.ub):sat` | `u8_sat(i16(u8_1) - i16(u8_2))` | `u8(hvx_builtin(Int(8), "sat_sub.vub.vub.dv", {u8_1, u8_2}))` |
| `vsub(Vxx.uh,Vyy.uh):sat` | `u16_sat(i32(u16_1) - i32(u16_2))` | `u16(hvx_builtin(Int(16), "sat_sub.vuh.vuh.dv", {u16_1, u16_2}))` |
| `vsub(Vxx.h,Vyy.h):sat` | `i16_sat(i32(i16_1) - i32(i16_2))` | `hvx_builtin(Int(16), "sat_sub.vh.vh.dv", {i16_1, i16_2})` |
| `vsub(Vxx.w,Vyy.w):sat` | `i32_sat(i64(i32_1) - i64(i32_2))` | `hvx_builtin(Int(32), "sat_sub.vw.vw.dv", {i32_1, i32_2})` |
| `vavg(Vx.ub,Vy.ub)` | `u8((u16(u8_1) + u16(u8_2)) / 2)` | `hvx_builtin(UInt(8), "avg.vub.vub", {u8_1, u8_2})` |
| `vavg(Vx.ub,Vy.ub):rnd` | `u8((u16(u8_1) + u16(u8_2) + 1) / 2)` | `hvx_builtin(UInt(8), "avg_rnd.vub.vub", {u8_1, u8_2})` |
| `vavg(Vx.uh,Vy.uh)` | `u16((u32(u16_1) + u32(u16_2)) / 2)` | `hvx_builtin(UInt(16), "avg.vuh.vuh", {u16_1, u16_2})` |
| `vavg(Vx.uh,Vy.uh):rnd` | `u16((u32(u16_1) + u32(u16_2) + 1) / 2)` | `hvx_builtin(UInt(16), "avg_rnd.vuh.vuh", {u16_1, u16_2})` |
| `vavg(Vx.h,Vy.h)` | `i16((i32(i16_1) + i32(i16_2)) / 2)` | `hvx_builtin(Int(16), "avg.vh.vh", {i16_1, i16_2})` |
| `vavg(Vx.h,Vy.h):rnd` | `i16((i32(i16_1) + i32(i16_2) + 1) / 2)` | `hvx_builtin(Int(16), "avg_rnd.vh.vh", {i16_1, i16_2})` |
| `vavg(Vx.w,Vy.w)` | `i32((i64(i32_1) + i64(i32_2)) / 2)` | `hvx_builtin(Int(32), "avg.vw.vw", {i32_1, i32_2})` |
| `vavg(Vx.w,Vy.w):rnd` | `i32((i64(i32_1) + i64(i32_2) + 1) / 2)` | `hvx_builtin(Int(32), "avg_rnd.vw.vw", {i32_1, i32_2})` |
| `vavg(Vx.b,Vy.b)` | `i8((i16(i8_1) + i16(i8_2)) / 2)` | `hvx_builtin(Int(8), "avg.vb.vb", {i8_1, i8_2})` |
| `vavg(Vx.uw,Vy.uw)` | `u32((u64(u32_1) + u64(u32_2)) / 2)` | `hvx_builtin(UInt(32), "avg.vuw.vuw", {u32_1, u32_2})` |
| `vnavg(Vx.ub,Vy.ub)` | `i8((i16(u8_1) - i16(u8_2)) / 2)` | `hvx_builtin(Int(8), "navg.vub.vub", {u8_1, u8_2})` |
| `vnavg(Vx.h,Vy.h)` | `i16((i32(i16_1) - i32(i16_2)) / 2)` | `hvx_builtin(Int(16), "navg.vh.vh", {i16_1, i16_2})` |
| `vnavg(Vx.w,Vy.w)` | `i32((i64(i32_1) - i64(i32_2)) / 2)` | `hvx_builtin(Int(32), "navg.vw.vw", {i32_1, i32_2})` |
| `vavg(Vx.b,Vy.b)` | `i8((i16(i8_1) + i16(i8_2)) / 2)` | `hvx_builtin(Int(8), "avg.vb.vb", {i8_1, i8_2})` |
| `vavg(Vx.b,Vy.b):rnd` | `i8((i16(i8_1) + i16(i8_2) + 1) / 2)` | `hvx_builtin(Int(8), "avg_rnd.vb.vb", {i8_1, i8_2})` |
| `vavg(Vx.uw,Vy.uw)` | `u32((u64(u32_1) + u64(u32_2)) / 2)` | `hvx_builtin(UInt(32), "avg.vuw.vuw", {u32_1, u32_2})` |
| `vavg(Vx.uw,Vy.uw):rnd` | `u32((u64(u32_1) + u64(u32_2) + 1) / 2)` | `hvx_builtin(UInt(32), "avg_rnd.vuw.vuw", {u32_1, u32_2})` |
| `vnavg(Vx.b,Vy.b)` | `i8((i16(i8_1) - i16(i8_2)) / 2)` | `hvx_builtin(Int(8), "navg.vb.vb", {i8_1, i8_2})` |
| `vlsr(Vx.h,Vy.h)` | `u8_1 >> (u8_2 % 8)` | `u8(hvx_builtin(UInt(16), "shr.vuh.vh", {u16(u8_1), u16(u8_2) % 8}))` |
| `vlsr(Vx.h,Vy.h)` | `u16_1 >> (u16_2 % 16)` | `hvx_builtin(UInt(16), "shr.vuh.vh", {u16_1, u16_2 % 16})` |
| `vlsr(Vx.w,Vy.w)` | `u32_1 >> (u32_2 % 32)` | `hvx_builtin(UInt(32), "shr.vuw.vw", {u32_1, u32_2 % 32})` |
| `vasr(Vx.h,Vy.h)` | `i8_1 >> (u8_2 % 8)` | `i8(hvx_builtin(Int(16), "shr.vh.vh", {i16(i8_1), u16(u8_2 % 8)}))` |
| `vasr(Vx.h,Vy.h)` | `i16_1 >> (u16_2 % 16)` | `hvx_builtin(Int(16), "shr.vh.vh", {i16_1, u16_2 % 16})` |
| `vasr(Vx.w,Vy.w)` | `i32_1 >> (u32_2 % 32)` | `hvx_builtin(Int(32), "shr.vw.vw", {i32_1, u32_2 % 32})` |
| `vasr(Vx.h,Vy.h,r):sat` | `u8_sat(i16_1 >> 4)` | `hvx_builtin(UInt(8), "trunc_satub_shr.vh.uh", {i16_1, 4})` |
| `vasr(Vx.w,Vy.w,r):sat` | `u16_sat(i32_1 >> 8)` | `hvx_builtin(UInt(16), "trunc_satuh_shr.vw.uw", {i32_1, 8})` |
| `vasr(Vx.w,Vy.w,r):sat` | `i16_sat(i32_1 >> 8)` | `hvx_builtin(Int(16), "trunc_sath_shr.vw.uw", {i32_1, 8})` |
| `vasr(Vx.w,Vy.w,r)` | `i16(i32_1 >> 8)` | `hvx_builtin(Int(16), "trunc_shr.vw.uw", {i32_1, 8})` |
| `vasl(Vx.h,Vy.h)` | `u8_1 << (u8_2 % 8)` | `u8(hvx_builtin(UInt(16), "shl.vuh.vh", {u8_1, u8_2 % 8}))` |
| `vasl(Vx.h,Vy.h)` | `u16_1 << (u16_2 % 16)` | `hvx_builtin(UInt(16), "shl.vuh.vh", {u16_1, u16_2 % 16})` |
| `vasl(Vx.w,Vy.w)` | `u32_1 << (u32_2 % 32)` | `hvx_builtin(UInt(32), "shl.vuw.vw", {u32_1, u32_2 % 32})` |
| `vasl(Vx.h,Vy.h)` | `i8_1 << (u8_2 % 8)` | `i8(hvx_builtin(Int(16), "shl.vh.vh", {i8_1, u8_2 % 8}))` |
| `vasl(Vx.h,Vy.h)` | `i16_1 << (u16_2 % 16)` | `hvx_builtin(Int(16), "shl.vh.vh", {i16_1, u16_2 % 16})` |
| `vasl(Vx.w,Vy.w)` | `i32_1 << (u32_2 % 32)` | `hvx_builtin(Int(32), "shl.vw.vw", {i32_1, u32_2 % 32})` |
| `vlsr(Vx.h,Vy.h)` | `u8_1 >> (i8_2 % 16 - 8)` | `u8(hvx_builtin(UInt(16), "shr.vuh.vh", {u8_1, u16(i8_2 % 16 - 8)}))` |
| `vlsr(Vx.h,Vy.h)` | `u16_1 >> (i16_2 % 32 - 16)` | `hvx_builtin(UInt(16), "shr.vuh.vh", {u16_1, u16(i16_2 % 32 - 16)})` |
| `vlsr(Vx.w,Vy.w)` | `u32_1 >> (i32_2 % 64 - 32)` | `hvx_builtin(UInt(32), "shr.vuw.vw", {u32_1, u32(i32_2 % 64 - 32)})` |
| `vasr(Vx.h,Vy.h)` | `i8_1 >> (i8_2 % 16 - 8)` | `i8(hvx_builtin(Int(16), "shr.vh.vh", {i8_1, u16(i8_2 % 16 - 8)}))` |
| `vasr(Vx.h,Vy.h)` | `i16_1 >> (i16_2 % 32 - 16)` | `hvx_builtin(Int(16), "shr.vh.vh", {i16_1, u16(i16_2 % 32 - 16)})` |
| `vasr(Vx.w,Vy.w)` | `i32_1 >> (i32_2 % 64 - 32)` | `hvx_builtin(Int(32), "shr.vw.vw", {i32_1, u32(i32_2 % 64 - 32)})` |
| `vasl(Vx.h,Vy.h)` | `u8_1 << (i8_2 % 16 - 8)` | `u8(hvx_builtin(UInt(16), "shl.vuh.vh", {u8_1, u16(i8_2 % 16 - 8)}))` |
| `vasl(Vx.h,Vy.h)` | `u16_1 << (i16_2 % 32 - 16)` | `hvx_builtin(UInt(16), "shl.vuh.vh", {u16_1, u16(i16_2 % 32 - 16)})` |
| `vasl(Vx.w,Vy.w)` | `u32_1 << (i32_2 % 64 - 32)` | `hvx_builtin(UInt(32), "shl.vuw.vw", {u32_1, u32(i32_2 % 64 - 32)})` |
| `vasl(Vx.h,Vy.h)` | `i8_1 << (i8_2 % 16 - 8)` | `i8(hvx_builtin(Int(16), "shl.vh.vh", {i8_1, u16(i8_2 % 16 - 8)}))` |
| `vasl(Vx.h,Vy.h)` | `i16_1 << (i16_2 % 32 - 16)` | `hvx_builtin(Int(16), "shl.vh.vh", {i16_1, u16(i16_2 % 32 - 16)})` |
| `vasl(Vx.w,Vy.w)` | `i32_1 << (i32_2 % 64 - 32)` | `hvx_builtin(Int(32), "shl.vw.vw", {i32_1, u32(i32_2 % 64 - 32)})` |
| `vlsr(Vx.uh,r)` | `u8_1 >> (u8(y) % 8)` | `u8(hvx_builtin(UInt(16), "shr.vuh.h", {u8_1, u8(y) % 8}))` |
| `vlsr(Vx.uh,r)` | `u16_1 >> (u16(y) % 16)` | `hvx_builtin(UInt(16), "shr.vuh.h", {u16_1, u16(y) % 16})` |
| `vlsr(Vx.uw,r)` | `u32_1 >> (u32(y) % 32)` | `hvx_builtin(UInt(32), "shr.vuw.w", {u32_1, u32(y) % 32})` |
| `vasr(Vx.h,r)` | `i8_1 >> (u8(y) % 8)` | `i8(hvx_builtin(Int(16), "shr.vh.h", {i8_1, u8(y) % 8}))` |
| `vasr(Vx.h,r)` | `i16_1 >> (u16(y) % 16)` | `hvx_builtin(Int(16), "shr.vh.h", {i16_1, u16(y) % 16})` |
| `vasr(Vx.w,r)` | `i32_1 >> (u32(y) % 32)` | `hvx_builtin(Int(32), "shr.vw.w", {i32_1, u32(y) % 32})` |
| `vasl(Vx.h,r)` | `u8_1 << (u8(y) % 8)` | `u8(hvx_builtin(UInt(16), "shl.vuh.h", {u8_1, u8(y) % 8}))` |
| `vasl(Vx.h,r)` | `u16_1 << (u16(y) % 16)` | `hvx_builtin(UInt(16), "shl.vuh.h", {u16_1, u16(y) % 16})` |
| `vasl(Vx.w,r)` | `u32_1 << (u32(y) % 32)` | `hvx_builtin(UInt(32), "shl.vuw.w", {u32_1, u32(y) % 32})` |
| `vasl(Vx.h,r)` | `i8_1 << (u8(y) % 8)` | `i8(hvx_builtin(Int(16), "shl.vh.h", {i8_1, u8(y) % 8}))` |
| `vasl(Vx.h,r)` | `i16_1 << (u16(y) % 16)` | `hvx_builtin(Int(16), "shl.vh.h", {i16_1, u16(y) % 16})` |
| `vasl(Vx.w,r)` | `i32_1 << (u32(y) % 32)` | `hvx_builtin(Int(32), "shl.vw.w", {i32_1, u32(y) % 32})` |
| `vlsr(Vx.uh,r)` | `u8_1 >> (i8(y) % 16 - 8)` | `u8(hvx_builtin(UInt(16), "shr.vuh.h", {u8_1, u8(i8(y) % 16 - 8)}))` |
| `vlsr(Vx.uh,r)` | `u16_1 >> (i16(y) % 32 - 16)` | `hvx_builtin(UInt(16), "shr.vuh.h", {u16_1, u16(i16(y) % 32 - 16)})` |
| `vlsr(Vx.uw,r)` | `u32_1 >> (i32(y) % 64 - 32)` | `hvx_builtin(UInt(32), "shr.vuw.w", {u32_1, u32(i32(y) % 64 - 32)})` |
| `vasr(Vx.h,r)` | `i8_1 >> (i8(y) % 16 - 8)` | `i8(hvx_builtin(Int(16), "shr.vh.h", {i8_1, u8(i8(y) % 16 - 8)}))` |
| `vasr(Vx.h,r)` | `i16_1 >> (i16(y) % 32 - 16)` | `hvx_builtin(Int(16), "shr.vh.h", {i16_1, u16(i16(y) % 32 - 16)})` |
| `vasr(Vx.w,r)` | `i32_1 >> (i32(y) % 64 - 32)` | `hvx_builtin(Int(32), "shr.vw.w", {i32_1, u32(i32(y) % 64 - 32)})` |
| `vasl(Vx.h,r)` | `u8_1 << (i8(y) % 16 - 8)` | `u8(hvx_builtin(UInt(16), "shl.vuh.h", {u8_1, u8(i8(y) % 16 - 8)}))` |
| `vasl(Vx.h,r)` | `u16_1 << (i16(y) % 32 - 16)` | `hvx_builtin(UInt(16), "shl.vuh.h", {u16_1, u16(i16(y) % 32 - 16)})` |
| `vasl(Vx.w,r)` | `u32_1 << (i32(y) % 64 - 32)` | `hvx_builtin(UInt(32), "shl.vuw.w", {u32_1, u32(i32(y) % 64 - 32)})` |
| `vasl(Vx.h,r)` | `i8_1 << (i8(y) % 16 - 8)` | `i8(hvx_builtin(Int(16), "shl.vh.h", {i8_1, u8(i8(y) % 16 - 8)}))` |
| `vasl(Vx.h,r)` | `i16_1 << (i16(y) % 32 - 16)` | `hvx_builtin(Int(16), "shl.vh.h", {i16_1, u16(i16(y) % 32 - 16)})` |
| `vasl(Vx.w,r)` | `i32_1 << (i32(y) % 64 - 32)` | `hvx_builtin(Int(32), "shl.vw.w", {i32_1, u32(i32(y) % 64 - 32)})` |
| `vpacke(Vx.h,Vy.h)` | `u8(u16_1)` | `u8(hvx_builtin(Int(8), "pack.vh", {i16(u16_1)}))` |
| `vpacke(Vx.h,Vy.h)` | `u8(i16_1)` | `u8(hvx_builtin(Int(8), "pack.vh", {i16_1}))` |
| `vpacke(Vx.h,Vy.h)` | `i8(u16_1)` | `hvx_builtin(Int(8), "pack.vh", {i16(u16_1)})` |
| `vpacke(Vx.h,Vy.h)` | `i8(i16_1)` | `hvx_builtin(Int(8), "pack.vh", {i16_1})` |
| `vpacke(Vx.w,Vy.w)` | `u16(u32_1)` | `u16(hvx_builtin(Int(16), "pack.vw", {i32(u32_1)}))` |
| `vpacke(Vx.w,Vy.w)` | `u16(i32_1)` | `u16(hvx_builtin(Int(16), "pack.vw", {i32_1}))` |
| `vpacke(Vx.w,Vy.w)` | `i16(u32_1)` | `hvx_builtin(Int(16), "pack.vw", {i32(u32_1)})` |
| `vpacke(Vx.w,Vy.w)` | `i16(i32_1)` | `hvx_builtin(Int(16), "pack.vw", {i32_1})` |
| `vpacko(Vx.h,Vy.h)` | `u8(u16_1 >> 8)` | `u8(hvx_builtin(Int(8), "packhi.vh", {i16(u16_1)}))` |
| `vpacko(Vx.h,Vy.h)` | `u8(i16_1 >> 8)` | `u8(hvx_builtin(Int(8), "packhi.vh", {i16_1}))` |
| `vpacko(Vx.h,Vy.h)` | `i8(u16_1 >> 8)` | `hvx_builtin(Int(8), "packhi.vh", {i16(u16_1)})` |
| `vpacko(Vx.h,Vy.h)` | `i8(i16_1 >> 8)` | `hvx_builtin(Int(8), "packhi.vh", {i16_1})` |
| `vpacko(Vx.w,Vy.w)` | `u16(u32_1 >> 16)` | `u16(hvx_builtin(Int(16), "packhi.vw", {i32(u32_1)}))` |
| `vpacko(Vx.w,Vy.w)` | `u16(i32_1 >> 16)` | `u16(hvx_builtin(Int(16), "packhi.vw", {i32_1}))` |
| `vpacko(Vx.w,Vy.w)` | `i16(u32_1 >> 16)` | `hvx_builtin(Int(16), "packhi.vw", {i32(u32_1)})` |
| `vpacko(Vx.w,Vy.w)` | `i16(i32_1 >> 16)` | `hvx_builtin(Int(16), "packhi.vw", {i32_1})` |
| `vshuffe(Vx.b,Vy.b)` | `u8(u16(u8_1) * 127)` | `u8(hvx_builtin(Int(8), "trunc.vh", {i16(u8_1) * 127}))` |
| `vshuffe(Vx.b,Vy.b)` | `u8(i16(i8_1) * 63)` | `u8(hvx_builtin(Int(8), "trunc.vh", {i16(i8_1) * 63}))` |
| `vshuffe(Vx.b,Vy.b)` | `i8(u16(u8_1) * 127)` | `hvx_builtin(Int(8), "trunc.vh", {i16(u8_1) * 127})` |
| `vshuffe(Vx.b,Vy.b)` | `i8(i16(i8_1) * 63)` | `hvx_builtin(Int(8), "trunc.vh", {i16(i8_1) * 63})` |
| `vshuffe(Vx.h,Vy.h)` | `u16(u32(u16_1) * 32767)` | `u16(hvx_builtin(Int(16), "trunc.vw", {i32(u16_1) * 32767}))` |
| `vshuffe(Vx.h,Vy.h)` | `u16(i32(i16_1) * 16383)` | `u16(hvx_builtin(Int(16), "trunc.vw", {i32(i16_1) * 16383}))` |
| `vshuffe(Vx.h,Vy.h)` | `i16(u32(u16_1) * 32767)` | `hvx_builtin(Int(16), "trunc.vw", {i32(u16_1) * 32767})` |
| `vshuffe(Vx.h,Vy.h)` | `i16(i32(i16_1) * 16383)` | `hvx_builtin(Int(16), "trunc.vw", {i32(i16_1) * 16383})` |
| `vshuffo(Vx.b,Vy.b)` | `u8((u16(u8_1) * 127) >> 8)` | `u8(hvx_builtin(Int(8), "trunclo.vh", {i16(u8_1) * 127}))` |
| `vshuffo(Vx.b,Vy.b)` | `u8((i16(i8_1) * 63) >> 8)` | `u8(hvx_builtin(Int(8), "trunclo.vh", {i16(i8_1) * 63}))` |
| `vshuffo(Vx.b,Vy.b)` | `i8((u16(u8_1) * 127) >> 8)` | `hvx_builtin(Int(8), "trunclo.vh", {i16(u8_1) * 127})` |
| `vshuffo(Vx.b,Vy.b)` | `i8((i16(i8_1) * 63) >> 8)` | `hvx_builtin(Int(8), "trunclo.vh", {i16(i8_1) * 63})` |
| `vshuffo(Vx.h,Vy.h)` | `u16((u32(u16_1) * 32767) >> 16)` | `u16(hvx_builtin(Int(16), "trunclo.vw", {i32(u16_1) * 32767}))` |
| `vshuffo(Vx.h,Vy.h)` | `u16((i32(i16_1) * 16383) >> 16)` | `u16(hvx_builtin(Int(16), "trunclo.vw", {i32(i16_1) * 16383}))` |
| `vshuffo(Vx.h,Vy.h)` | `i16((u32(u16_1) * 32767) >> 16)` | `hvx_builtin(Int(16), "trunclo.vw", {i32(u16_1) * 32767})` |
| `vshuffo(Vx.h,Vy.h)` | `i16((i32(i16_1) * 16383) >> 16)` | `hvx_builtin(Int(16), "trunclo.vw", {i32(i16_1) * 16383})` |
| `Vx.ub = vpack(Vy.h,Vz.h):sat` | `u8_sat(i16_1)` | `hvx_builtin(UInt(8), "pack_satub.vh", {i16_1})` |
| `Vx.b = vpack(Vy.h,Vz.h):sat` | `i8_sat(i16_1)` | `hvx_builtin(Int(8), "pack_satb.vh", {i16_1})` |
| `Vx.uh = vpack(Vy.w,Vz.w):sat` | `u16_sat(i32_1)` | `hvx_builtin(UInt(16), "pack_satuh.vw", {i32_1})` |
| `Vx.h = vpack(Vy.w,Vz.w):sat` | `i16_sat(i32_1)` | `hvx_builtin(Int(16), "pack_sath.vw", {i32_1})` |
| `Vx.ub = vsat(Vy.h,Vz.h)` | `u8_sat(i16(i8_1) << 1)` | `hvx_builtin(UInt(8), "trunc_satub.vh", {i16(i8_1) << 1})` |
| `Vx.uh = vasr(Vy.w,Vz.w,r):sat` | `u16_sat(i32(i16_1) << 1)` | `hvx_builtin(UInt(16), "trunc_satuh_shr.vw.uw", {i32, 8})` |
| `Vx.h = vsat(Vy.w,Vz.w)` | `i16_sat(i32(i16_1) << 1)` | `hvx_builtin(Int(16), "trunc_sath.vw", {i32(i16_1) << 1})` |
| `Vx.ub = vpack(Vy.h,Vz.h):sat` | `u8_sat(i32_1)` | `hvx_builtin(UInt(8), "pack_satub.vh", {hvx_builtin(Int(16), "pack_sath.vw", {i32_1})})` |
| `Vx.b = vpack(Vy.h,Vz.h):sat` | `i8_sat(i32_1)` | `hvx_builtin(Int(8), "pack_satb.vh", {hvx_builtin(Int(16), "pack_sath.vw", {i32_1})})` |
| `Vx.h = vsat(Vy.w,Vz.w)` | `u8_sat(i32(i16_1) << 8)` | `hvx_builtin(UInt(8), "pack_satub.vh", {hvx_builtin(Int(16), "trunc_sath.vw", {i32(i16_1) << 8})})` |
| `Vx.uh = vsat(Vy.uw, Vz.uw)` | `u16_sat(u32_1)` | `hvx_builtin(UInt(16), "trunc_satuh.vuw", {u32_1})` |
| `vround(Vx.h,Vy.h)` | `u8_sat((i32(i16_1) + 128) / 256)` | `hvx_builtin(UInt(8), "trunc_satub_rnd.vh", {i16_1})` |
| `vround(Vx.h,Vy.h)` | `i8_sat((i32(i16_1) + 128) / 256)` | `hvx_builtin(Int(8), "trunc_satb_rnd.vh", {i16_1})` |
| `vround(Vx.uh,Vy.uh)` | `u8_sat((u32(u16_1) + 128) / 256)` | `hvx_builtin(UInt(8), "trunc_satub_rnd.vuh", {u16_1})` |
| `vround(Vx.w,Vy.w)` | `u16_sat((i32_1 + 32768) / 65536)` | `hvx_builtin(UInt(16), "trunc_satuh_rnd.vw", {i32_1})` |
| `vround(Vx.w,Vy.w)` | `i16_sat((i32_1 + 32768) / 65536)` | `hvx_builtin(Int(16), "trunc_sath_rnd.vw", {i32_1})` |
| `vround(Vx.w,Vy.w)` | `u16_sat((i64(i32_1) + 32768) / 65536)` | `hvx_builtin(UInt(16), "trunc_satuh_rnd.vw", {i32_1})` |
| `vround(Vx.w,Vy.w)` | `i16_sat((i64(i32_1) + 32768) / 65536)` | `hvx_builtin(Int(16), "trunc_sath_rnd.vw", {i32_1})` |
| `vround(Vx.uw,Vy.uw)` | `u16_sat((u64(u32_1) + 32768) / 65536)` | `hvx_builtin(UInt(16), "trunc_satuh_rnd.vuw", {u32_1})` |
| `Vx.ub = vasr(Vy.h,Vz.h,r):rnd:sat` | `u8_sat((i32(i16_1) + 8) / 16)` | `hvx_builtin(UInt(8), "trunc_satub_shr_rnd.vh", {i16_1, 4})` |
| `Vx.b = vasr(Vy.h,Vz.h,r):rnd:sat` | `i8_sat((i32(i16_1) + 16) / 32)` | `hvx_builtin(Int(8), "trunc_satb_shr_rnd.vh", {i16_1, 5})` |
| `Vx.ub = vasr(Vy.uh,Vz.uh,r):rnd:sat` | `u8_sat((u32(u16_1) + 32) / 64)` | `hvx_builtin(UInt(8), "trunc_satub_shr_rnd.vuh", {u16_1, 6})` |
| `Vx.uh = vasr(Vy.w,Vz.w,r):rnd:sat` | `u16_sat((i32_1 + 64) / 128)` | `hvx_builtin(UInt(16), "trunc_satuh_shr_rnd.vw", {i32_1, 7})` |
| `Vx.h = vasr(Vy.w,Vz.w,r):rnd:sat` | `i16_sat((i32_1 + 128) / 256)` | `hvx_builtin(Int(16), "trunc_sath_shr_rnd.vw", {i32_1, 8})` |
| `Vx.uh = vasr(Vy.w,Vz.w,r):rnd:sat` | `u16_sat((i64(i32_1) + 256) / 512)` | `hvx_builtin(UInt(16), "trunc_satuh_shr_rnd.vw", {i32_1, 9})` |
| `Vx.h = vasr(Vy.w,Vz.w,r):rnd:sat` | `i16_sat((i64(i32_1) + 512) / 1024)` | `hvx_builtin(Int(16), "trunc_sath_shr_rnd.vw", {i32_1, 10})` |
| `Vx.uh = vasr(Vy.uw,Vz.uw,r):rnd:sat` | `u16_sat((u64(u32_1) + 1024) / 2048)` | `hvx_builtin(UInt(16), "trunc_satuh_shr_rnd.vuw", {u32_1, 11})` |
| `vmax(Vx.ub,Vy.ub)` | `max(u8_1, u8_2)` | `hvx_builtin(UInt(8), "max.vub.vub", {u8_1, u8_2})` |
| `vmax(Vx.uh,Vy.uh)` | `max(u16_1, u16_2)` | `hvx_builtin(UInt(16), "max.vuh.vuh", {u16_1, u16_2})` |
| `vmax(Vx.h,Vy.h)` | `max(i16_1, i16_2)` | `hvx_builtin(Int(16), "max.vh.vh", {i16_1, i16_2})` |
| `vmax(Vx.w,Vy.w)` | `max(i32_1, i32_2)` | `hvx_builtin(Int(32), "max.vw.vw", {i32_1, i32_2})` |
| `vmin(Vx.ub,Vy.ub)` | `min(u8_1, u8_2)` | `hvx_builtin(UInt(8), "min.vub.vub", {u8_1, u8_2})` |
| `vmin(Vx.uh,Vy.uh)` | `min(u16_1, u16_2)` | `hvx_builtin(UInt(16), "min.vuh.vuh", {u16_1, u16_2})` |
| `vmin(Vx.h,Vy.h)` | `min(i16_1, i16_2)` | `hvx_builtin(Int(16), "min.vh.vh", {i16_1, i16_2})` |
| `vmin(Vx.w,Vy.w)` | `min(i32_1, i32_2)` | `hvx_builtin(Int(32), "min.vw.vw", {i32_1, i32_2})` |
| `vabsdiff(Vx.ub,Vy.ub)` | `absd(u8_1` | `u8_2), hvx_builtin(UInt(8), "absd.vub.vub", {u8_1, u8_2})` |
| `vabsdiff(Vx.uh,Vy.uh)` | `absd(u16_1` | `u16_2), hvx_builtin(UInt(16), "absd.vuh.vuh", {u16_1, u16_2})` |
| `vabsdiff(Vx.h,Vy.h)` | `absd(i16_1` | `i16_2), hvx_builtin(UInt(16), "absd.vh.vh", {i16_1, i16_2})` |
| `vabsdiff(Vx.w,Vy.w)` | `absd(i32_1` | `i32_2), hvx_builtin(UInt(32), "absd.vw.vw", {i32_1, i32_2})` |
| `vmpa(Vx.h,r.b)` | `5 * (i32(i16_1) + 7 * i32(i16_2))` | `hvx_builtin(Int(32), "add_2mpy.vh.vh.b.b", {i16_1, i16_2, 5, 35})` |
| `vmpa(Vx.h,r.b)` | `5 * (i32(i16_1) - 7 * i32(i16_2))` | `hvx_builtin(Int(32), "add_2mpy.vh.vh.b.b", {i16_1, i16_2, 5, -35})` |
| `vabs(Vx.h)` | `abs(i16_1)` | `hvx_builtin(UInt(16), "abs.vh", {i16_1})` |
| `vabs(Vx.w)` | `abs(i32_1)` | `hvx_builtin(UInt(32), "abs.vw", {i32_1})` |
| `vabs(Vx.b)` | `abs(i8_1)` | `hvx_builtin(UInt(8), "abs.vb", {i8_1})` |
| `vmpy(Vx.ub,Vy.ub)` | `u16(u8_1) * u16(u8_2)` | `hvx_builtin(UInt(16), "mpy.vub.vub", {u8_1, u8_2})` |
| `vmpy(Vx.b,Vy.b)` | `i16(i8_1) * i16(i8_2)` | `hvx_builtin(Int(16), "mpy.vb.vb", {i8_1, i8_2})` |
| `vmpy(Vx.uh,Vy.uh)` | `u32(u16_1) * u32(u16_2)` | `hvx_builtin(UInt(32), "mpy.vuh.vuh", {u16_1, u16_2})` |
| `vmpy(Vx.h,Vy.h)` | `i32(i16_1) * i32(i16_2)` | `hvx_builtin(Int(32), "mpy.vh.vh", {i16_1, i16_2})` |
| `vmpyi(Vx.h,Vy.h)` | `i16_1 * i16_2` | `i16(hvx_builtin(Int(32), "mul.vh.vh", {i16_1, i16_2}))` |
| `vmpyio(Vx.w,Vy.h)` | `i32_1 * i32(i16_1)` | `hvx_builtin(Int(32), "mul.vw.vh", {i32_1, i16_1})` |
| `vmpyie(Vx.w,Vy.uh)` | `i32_1 * i32(u16_1)` | `hvx_builtin(Int(32), "mul.vw.vuh", {i32_1, u16_1})` |
| `vmpy(Vx.uh,Vy.uh)` | `u32_1 * u32(u16_1)` | `hvx_builtin(UInt(32), "mul.vuw.vuw", {u32_1, u16_1})` |
| `vmpyieo(Vx.h,Vy.h)` | `i32_1 * i32_2` | `hvx_builtin(Int(32), "mul.vw.vw", {i32_1, i32_2})` |
| `vmpy(Vx.ub,Vy.b)` | `i16(u8_1) * i16(i8_2)` | `hvx_builtin(Int(16), "mpy.vub.vb", {u8_1, i8_2})` |
| `vmpy(Vx.h,Vy.uh)` | `i32(u16_1) * i32(i16_2)` | `hvx_builtin(Int(32), "mpy.vh.vuh", {i16_2, u16_1})` |
| `vmpy(Vx.ub,Vy.b)` | `i16(i8_1) * i16(u8_2)` | `hvx_builtin(Int(16), "mpy.vub.vb", {u8_2, i8_1})` |
| `vmpy(Vx.h,Vy.uh)` | `i32(i16_1) * i32(u16_2)` | `hvx_builtin(Int(32), "mpy.vh.vuh", {i16_1, u16_2})` |
| `vmpy(Vx.ub,r.b)` | `i16(u8_1) * 3` | `hvx_builtin(Int(16), "mpy.vub.b", {u8_1, 3})` |
| `vmpy(Vx.h,r.h)` | `i32(i16_1) * 10` | `hvx_builtin(Int(32), "mpy.vh.h", {i16_1, 10})` |
| `vmpy(Vx.ub,r.ub)` | `u16(u8_1) * 3` | `hvx_builtin(UInt(16), "mpy.vub.ub", {u8_1, 3})` |
| `vmpy(Vx.uh,r.uh)` | `u32(u16_1) * 10` | `hvx_builtin(UInt(32), "mpy.vuh.uh", {u16_1, 10})` |
| `vmpy(Vx.ub,r.b)` | `3 * i16(u8_1)` | `hvx_builtin(Int(16), "mpy.vub.b", {u8_1, 3})` |
| `vmpy(Vx.h,r.h)` | `10 * i32(i16_1)` | `hvx_builtin(Int(32), "mpy.vh.h", {i16_1, 10})` |
| `vmpy(Vx.ub,r.ub)` | `3 * u16(u8_1)` | `hvx_builtin(UInt(16), "mpy.vub.ub", {u8_1, 3})` |
| `vmpy(Vx.uh,r.uh)` | `10 * u32(u16_1)` | `hvx_builtin(UInt(32), "mpy.vuh.uh", {u16_1, 10})` |
| `vmpyi(Vx.h,r.b)` | `i16_1 * 127` | `hvx_builtin(Int(16), "mul.vh.b", {i16_1, 127})` |
| `vmpyi(Vx.h,r.b)` | `127 * i16_1` | `hvx_builtin(Int(16), "mul.vh.b", {i16_1, 127})` |
| `vmpyi(Vx.w,r.h)` | `i32_1 * 32767` | `hvx_builtin(Int(32), "mul.vw.h", {i32_1, 32767})` |
| `vmpyi(Vx.w,r.h)` | `32767 * i32_1` | `hvx_builtin(Int(32), "mul.vw.h", {i32_1, 32767})` |
| `Vx.h += vmpyi(Vy.h,Vz.h)` | `i16_1 + i16_2 * i16_3` | `hvx_builtin(Int(16), "add_mul.vh.vh.vh", {i16_1, i16_2, i16_3})` |
| `Vx.h += vmpyi(Vy.h,r.b)` | `i16_1 + i16_2 * 127` | `hvx_builtin(Int(16), "add_mul.vh.vh.b", {i16_1, i16_2, 127})` |
| `Vx.w += vmpyi(Vy.w,r.h)` | `i32_1 + i32_2 * 32767` | `hvx_builtin(Int(32), "add_mul.vw.vw.h", {i32_1, i32_2, 32767})` |
| `Vx.h += vmpyi(Vy.h,r.b)` | `i16_1 + 127 * i16_2` | `hvx_builtin(Int(16), "add_mul.vh.vh.b", {i16_1, i16_2, 127})` |
| `Vx.w += vmpyi(Vy.w,r.h)` | `i32_1 + 32767 * i32_2` | `hvx_builtin(Int(32), "add_mul.vw.vw.h", {i32_1, i32_2, 32767})` |
| `Vx.uh += vmpy(Vy.ub,Vz.ub)` | `u16_1 + u16(u8_1) * u16(u8_2)` | `hvx_builtin(UInt(16), "add_mpy.vuh.vub.vub", {u16_1, u8_1, u8_2})` |
| `Vx.uw += vmpy(Vy.uh,Vz.uh)` | `u32_1 + u32(u16_1) * u32(u16_2)` | `hvx_builtin(UInt(32), "add_mpy.vuw.vuh.vuh", {u32_1, u16_1, u16_2})` |
| `Vx.h += vmpy(Vy.b,Vz.b)` | `i16_1 + i16(i8_1) * i16(i8_2)` | `hvx_builtin(Int(16), "add_mpy.vh.vb.vb", {i16_1, i8_1, i8_2})` |
| `Vx.w += vmpy(Vy.h,Vz.h)` | `i32_1 + i32(i16_1) * i32(i16_2)` | `hvx_builtin(Int(32), "add_mpy.vw.vh.vh", {i32_1, i16_1, i16_2})` |
| `Vx.h += vmpy(Vy.ub,Vz.b)` | `i16_1 + i16(u8_1) * i16(i8_2)` | `hvx_builtin(Int(16), "add_mpy.vh.vub.vb", {i16_1, u8_1, i8_2})` |
| `Vx.w += vmpy(Vy.h,Vz.uh)` | `i32_1 + i32(i16_1) * i32(u16_2)` | `hvx_builtin(Int(32), "add_mpy.vw.vh.vuh", {i32_1, i16_1, u16_2})` |
| `Vx.h += vmpy(Vy.ub,Vz.b)` | `i16_1 + i16(u8_1) * i16(i8_2)` | `hvx_builtin(Int(16), "add_mpy.vh.vub.vb", {i16_1, u8_1, i8_2})` |
| `Vx.w += vmpy(Vy.h,Vz.uh)` | `i32_1 + i32(i16_1) * i32(u16_2)` | `hvx_builtin(Int(32), "add_mpy.vw.vh.vuh", {i32_1, i16_1, u16_2})` |
| `Vx.h += vmpy(Vy.ub,Vz.b)` | `i16_1 + i16(i8_1) * i16(u8_2)` | `hvx_builtin(Int(16), "add_mpy.vh.vub.vb", {i16_1, u8_2, i8_1})` |
| `Vx.w += vmpy(Vy.h,Vz.uh)` | `i32_1 + i32(u16_1) * i32(i16_2)` | `hvx_builtin(Int(32), "add_mpy.vw.vh.vuh", {i32_1, i16_2, u16_1})` |
| `Vx.h += vmpy(Vy.ub,Vz.b)` | `i16_1 + i16(i8_1) * i16(u8_2)` | `hvx_builtin(Int(16), "add_mpy.vh.vub.vb", {i16_1, u8_2, i8_1})` |
| `Vx.w += vmpy(Vy.h,Vz.uh)` | `i32_1 + i32(u16_1) * i32(i16_2)` | `hvx_builtin(Int(32), "add_mpy.vw.vh.vuh", {i32_1, i16_2, u16_1})` |
| `Vx.w += vmpy(Vy.h, r.h):sat` | `i32_1 + i32(i16_1) * 32767` | `hvx_builtin(Int(32), "satw_add_mpy.vw.vh.h", {i32_1, i16_1, 32767})` |
| `Vx.w += vmpy(Vy.h, r.h):sat` | `i32_1 + 32767 * i32(i16_1)` | `hvx_builtin(Int(32), "satw_add_mpy.vw.vh.h", {i32_1, i16_1, 32767})` |
| `Vx.uh += vmpy(Vy.ub,r.ub)` | `u16_1 + u16(u8_1) * 255` | `hvx_builtin(UInt(16), "add_mpy.vuh.vub.ub", {u16_1, u8_1, 255})` |
| `Vx.h += vmpy(Vy.ub,r.b)` | `i16_1 + i16(u8_1) * 127` | `hvx_builtin(Int(16), "add_mpy.vh.vub.b", {i16_1, u8_1, 127})` |
| `Vx.uw += vmpy(Vy.uh,r.uh)` | `u32_1 + u32(u16_1) * 65535` | `hvx_builtin(UInt(32), "add_mpy.vuw.vuh.uh", {u32_1, u16_1, 65535})` |
| `Vx.uh += vmpy(Vy.ub,r.ub)` | `u16_1 + 255 * u16(u8_1)` | `hvx_builtin(UInt(16), "add_mpy.vuh.vub.ub", {u16_1, u8_1, 255})` |
| `Vx.h += vmpy(Vy.ub,r.b)` | `i16_1 + 127 * i16(u8_1)` | `hvx_builtin(Int(16), "add_mpy.vh.vub.b", {i16_1, u8_1, 127})` |
| `Vx.uw += vmpy(Vy.uh,r.uh)` | `u32_1 + 65535 * u32(u16_1)` | `hvx_builtin(UInt(32), "add_mpy.vuw.vuh.uh", {u32_1, u16_1, 65535})` |
| `Vx.h += vmpy(Vy.ub,r.b)` | `i16_1 + i16(u8_1) * -127` | `hvx_builtin(Int(16), "add_mpy.vh.vub.b", {i16_1, u8_1, -127})` |
| `Vx.h += vmpyi(Vy.h,r.b)` | `i16_1 + i16_2 * -127` | `hvx_builtin(Int(16), "add_mul.vh.vh.b", {i16_1, i16_2, -127})` |
| `Vx.w += vmpy(Vy.h,r.h)` | `i32_1 + i32(i16_1) * 32767` | `hvx_builtin(Int(32), "satw_add_mpy.vw.vh.h", {i32_1, i16_1, 32767})` |
| `Vx.w += vmpy(Vy.h,r.h)` | `i32_1 + 32767 * i32(i16_1)` | `hvx_builtin(Int(32), "satw_add_mpy.vw.vh.h", {i32_1, i16_1, 32767})` |
| `vmpy(Vx.h,Vy.h):<<1:rnd:sat` | Factor: {1, 2} in `i16_sat((i32(i16_1) * i32(i16_2 * factor) + 16384) / 32768)` | `hvx_builtin(Int(16), "trunc_satw_mpy2_rnd.vh.vh", {i16_1, i16_2 * factor})` |
| `vmpyo(Vx.w,Vy.h)` | Factor: {1, 2} in `i32((i64(i32_1) * i64(i32_2 * factor)) / (i64(1) << 32))` | `hvx_builtin(Int(32), "trunc_mpy.vw.vw", {i32_1, i32_2 * factor})` |
| `vmpyo(Vx.w,Vy.h):<<1:sat` | Factor: {1, 2} in `i32_sat((i64(i32_1 * factor) * i64(i32_2)) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2.vw.vw", {i32_1 * factor, i32_2})` |
| `vmpyo(Vx.w,Vy.h):<<1:rnd:sat` | Factor: {1, 2} in `i32_sat((i64(i32_1) * i64(i32_2 * factor) + (1 << 30)) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2_rnd.vw.vw", {i32_1, i32_2 * factor})` |
| `vmpy(Vx.h,r.h):<<1:sat` | Scalar: {32766, 32767} in `i16_sat((i32(i16_1) * scalar) / 32768)` | `hvx_builtin(Int(16), "trunc_satw_mpy2.vh.h", {i16_1, scalar})` |
| `vmpy(Vx.h,r.h):<<1:sat` | Scalar: {32766, 32767} in `i16_sat((scalar * i32(i16_1)) / 32768)` | `hvx_builtin(Int(16), "trunc_satw_mpy2.vh.h", {i16_1, scalar})` |
| `vmpy(Vx.h,r.h):<<1:rnd:sat` | Scalar: {32766, 32767} in `i16_sat((i32(i16_1) * scalar + 16384) / 32768)` | `hvx_builtin(Int(16), "trunc_satw_mpy2_rnd.vh.h", {i16_1, scalar})` |
| `vmpy(Vx.h,r.h):<<1:rnd:sat` | Scalar: {32766, 32767} in `i16_sat((scalar * i32(i16_1) + 16384) / 32768)` | `hvx_builtin(Int(16), "trunc_satw_mpy2_rnd.vh.h", {i16_1, scalar})` |
| `vmpyo(Vx.w,Vy.h)` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32((i64(i32_1) * scalar) / (i64(1) << 32))` | `hvx_builtin(Int(32), "trunc_mpy.vw.vw", {i32_1, scalar})` |
| `vmpyo(Vx.w,Vy.h)` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32((scalar * i64(i32_2)) / (i64(1) << 32))` | `hvx_builtin(Int(32), "trunc_mpy.vw.vw", {i32_2, scalar})` |
| `vmpyo(Vx.w,Vy.h):<<1:sat` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32_sat((i64(i32_1) * scalar) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2.vw.vw", {i32_1, scalar})` |
| `vmpyo(Vx.w,Vy.h):<<1:sat` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32_sat((scalar * i64(i32_2)) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2.vw.vw", {i32_2, scalar})` |
| `vmpyo(Vx.w,Vy.h):<<1:rnd:sat` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32_sat((i64(i32_1) * scalar + (1 << 30)) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2_rnd.vw.vw", {i32_1, scalar})` |
| `vmpyo(Vx.w,Vy.h):<<1:rnd:sat` | Scalar: {INT\_MAX - 1, INT\_MAX} in `i32_sat((scalar * i64(i32_2) + (1 << 30)) / (i64(1) << 31))` | `hvx_builtin(Int(32), "trunc_satdw_mpy2_rnd.vw.vw", {i32_2, scalar})` |
| `vmpa(Vx.ub,r.b)` | `i16(u8_1) * 127 + i16(u8_2) * -128` | `hvx_builtin(Int(16), "add_2mpy.vub.vub.b.b", {u8_1, u8_2, 127, -128})` |
| `vmpa(Vx.ub,r.b)` | `i16(u8_1) * 127 + 126 * i16(u8_2)` | `hvx_builtin(Int(16), "add_2mpy.vub.vub.b.b", {u8_1, u8_2, 127, 126})` |
| `vmpa(Vx.ub,r.b)` | `-100 * i16(u8_1) + 40 * i16(u8_2)` | `hvx_builtin(Int(16), "add_2mpy.vub.vub.b.b", {u8_1, u8_2, -100, 40})` |
| `Vx.h += vmpa(Vy.ub,r.b)` | `2 * i16(u8_1) + 3 * i16(u8_2) + i16_1` | `hvx_builtin(Int(16), "acc_add_2mpy.vh.vub.vub.b.b", {i16_1, u8_1, u8_2, 2, 3})` |
| `vmpa(Vx.h,r.b)` | `i32(i16_1) * 2 + i32(i16_2) * 3` | `hvx_builtin(Int(32), "add_2mpy.vh.vh.b.b", {i16_1, i16_2, 2, 3})` |
| `vmpa(Vx.h,r.b)` | `i32(i16_1) * 2 + 3 * i32(i16_2)` | `hvx_builtin(Int(32), "add_2mpy.vh.vh.b.b", {i16_1, i16_2, 2, 3})` |
| `vmpa(Vx.h,r.b)` | `2 * i32(i16_1) + 3 * i32(i16_2)` | `hvx_builtin(Int(32), "add_2mpy.vh.vh.b.b", {i16_1, i16_2, 2, 3})` |
| `Vx.w += vmpa(Vy.h,r.b)` | `2 * i32(i16_1) + 3 * i32(i16_2) + i32_1` | `hvx_builtin(Int(32), "acc_add_2mpy.vw.vh.vh.b.b", {i32_1, i16_1, i16_2, 2, 3})` |
| `vdmpy(Vx.ub,r.b)` | `i16(in_u8(2 * x)) * 127 + i16(in_u8(2 * x + 1)) * -128` | `hvx_builtin(Int(16), "add_2mpy.vub.b", {hvx_builtin(UInt(8), "make_interleave", {in_u8(2 * x), in_u8(2 * x + 1)}), (int32_t)0x807F807F})` |
| `vdmpy(Vx.h,r.b)` | `i32(in_i16(2 * x)) * 2 + i32(in_i16(2 * x + 1)) * 3` | `hvx_builtin(Int(32), "add_2mpy.vh.b", {hvx_builtin(Int(16), "make_interleave", {in_i16(2 * x), in_i16(2 * x + 1)}), (int32_t)0x03020302})` |
| `Vx.h += vdmpy(Vy.ub,r.b)` | `i16(in_u8(2 * x)) * 120 + i16(in_u8(2 * x + 1)) * -50 + i16_1` | `hvx_builtin(Int(16), "acc_add_2mpy.vh.vub.b", {i16_1, hvx_builtin(UInt(8), "make_interleave", {in_u8(2 * x), in_u8(2 * x + 1)}), (int32_t)0xCE78CE78})` |
| `Vx.w += vdmpy(Vy.h,r.b)` | `i32(in_i16(2 * x)) * 80 + i32(in_i16(2 * x + 1)) * 33 + i32_1` | `hvx_builtin(Int(32), "acc_add_2mpy.vw.vh.b", {i32_1, hvx_builtin(Int(16), "make_interleave", {in_i16(2 * x), in_i16(2 * x + 1)}), (int32_t)0x21502150})` |
| `vrmpy(Vx.ub,r.ub)` | `u32(u8_1) * 255 + u32(u8_2) * 254 + u32(u8_3) * 253 + u32(u8_4) * 252` | `hvx_builtin(UInt(32), "add_4mpy.vub.ub", {hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), Expr((uint32_t)0xFCFDFEFF)})` |
| `vrmpy(Vx.ub,r.b)` | `i32(u8_1) * 127 + i32(u8_2) * -128 + i32(u8_3) * 126 + i32(u8_4) * -127` | `hvx_builtin(Int(32), "add_4mpy.vub.b", {hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), (int32_t)0x817E807F})` |
| `Vx.uw += vrmpy(Vy.ub,r.ub)` | `u32_1 + u32(u8_1) * 2 + u32(u8_2) * 3 + u32(u8_3) * 4 + u32(u8_4) * 5` | `hvx_builtin(UInt(32), "acc_add_4mpy.vuw.vub.ub", {u32_1, hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), Expr((uint32_t)0x05040302)})` |
| `Vx.w += vrmpy(Vy.ub,r.b)` | `i32_1 + i32(u8_1) * 2 + i32(u8_2) * -3 + i32(u8_3) * -4 + i32(u8_4) * 5` | `hvx_builtin(Int(32), "acc_add_4mpy.vw.vub.b", {i32_1, hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), (int32_t)0x05FCFD02})` |
| `vrmpy(Vx.ub,r.b)` | `i32(u8_1) + i32(u8_2) * -2 + i32(u8_3) * 3 + i32(u8_4) * -4` | `hvx_builtin(Int(32), "add_4mpy.vub.b", {hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), (int32_t)0xFC03FE01})` |
| `Vx.w += vrmpy(Vy.ub,r.b)` | `i32_1 + i32(u8_1) + i32(u8_2) * 2 + i32(u8_3) * 3 + i32(u8_4) * 4` | `hvx_builtin(Int(32), "acc_add_4mpy.vw.vub.b", {i32_1, hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), (int32_t)0x04030201})` |
| `vrmpy(Vx.ub,r.ub)` | `u32(u16(u8_1) * 255) + u32(u16(u8_2) * 254) + u32(u16(u8_3) * 253) + u32(u16(u8_4) * 252)` | `hvx_builtin(UInt(32), "add_4mpy.vub.ub", {hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), Expr((uint32_t)0xFCFDFEFF)})` |
| `Vx.w += vrmpy(Vy.ub,r.b)` | `i32_1 + i32(i16(u8_1) * 2) + i32(i16(u8_2) * -3) + i32(i16(u8_3) * -4) + i32(i16(u8_4) * 5)` | `hvx_builtin(Int(32), "acc_add_4mpy.vw.vub.b", {i32_1, hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), (int32_t)0x05FCFD02})` |
| `vrmpy(Vx.ub,Vy.ub)` | `u32(u8_1) * u8_1 + u32(u8_2) * u8_2 + u32(u8_3) * u8_3 + u32(u8_4) * u8_4` | `hvx_builtin(UInt(32), "add_4mpy.vub.vub", {hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4})})` |
| `vrmpy(Vx.b,Vy.b)` | `i32(i8_1) * i8_1 + i32(i8_2) * i8_2 + i32(i8_3) * i8_3 + i32(i8_4) * i8_4` | `hvx_builtin(Int(32), "add_4mpy.vb.vb", {hvx_builtin(Int(8), "make_interleave", {i8_1, i8_2, i8_3, i8_4}), hvx_builtin(Int(8), "make_interleave", {i8_1, i8_2, i8_3, i8_4})})` |
| `Vx.uw += vrmpy(Vy.ub,Vz.ub)` | `u32_1 + u32(u8_1) * u8_1 + u32(u8_2) * u8_2 + u32(u8_3) * u8_3 + u32(u8_4) * u8_4` | `hvx_builtin(UInt(32), "acc_add_4mpy.vuw.vub.vub", {u32_1, hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4}), hvx_builtin(UInt(8), "make_interleave", {u8_1, u8_2, u8_3, u8_4})})` |
| `Vx.w += vrmpy(Vy.b,Vz.b)` | `i32_1 + i32(i8_1) * i8_1 + i32(i8_2) * i8_2 + i32(i8_3) * i8_3 + i32(i8_4) * i8_4` | `hvx_builtin(Int(32), "acc_add_4mpy.vw.vb.vb", {i32_1, hvx_builtin(Int(8), "make_interleave", {i8_1, i8_2, i8_3, i8_4}), hvx_builtin(Int(8), "make_interleave", {i8_1, i8_2, i8_3, i8_4})})` |
| `vmpa(Vx.ub,r.b)` | `i16(u8_1) * 127 + i16(u8_2) * -126 + i16(u8_3) * 125 + i16(u8_4) * 124` | `hvx_builtin(Int(16), "acc_add_2mpy.vh.vub.vub.b.b", {hvx_builtin(Int(16), "add_2mpy.vub.vub.b.b", {u8_1, u8_2, 127, -126}), u8_3, u8_4, 125, 124})` |
| `Vx.w += vasl(Vy.w,r)` | `u32_1 + (u32_2 * 8)` | `u32(hvx_builtin(Int(32), "add_shl.vw.vw.uw", {i32(u32_1), i32(u32_2), 3}))` |
| `Vx.w += vasl(Vy.w,r)` | `i32_1 + (i32_2 * 8)` | `hvx_builtin(Int(32), "add_shl.vw.vw.uw", {i32_1, i32_2, 3})` |
| `Vx.w += vasr(Vy.w,r)` | `i32_1 + (i32_2 / 8)` | `hvx_builtin(Int(32), "add_shr.vw.vw.uw", {i32_1, i32_2, 3})` |
| `Vx.w += vasl(Vy.w,r)` | `i32_1 + (i32_2 << u32(y % 32))` | `hvx_builtin(Int(32), "add_shl.vw.vw.uw", {i32_1, i32_2, y % 32})` |
| `Vx.w += vasr(Vy.w,r)` | `i32_1 + (i32_2 >> u32(y % 32))` | `hvx_builtin(Int(32), "add_shr.vw.vw.uw", {i32_1, i32_2, y % 32})` |
| `Vx.h += vasl(Vy.h,r)` | `i16_1 + (i16_2 << u16(y % 16))` | `hvx_builtin(Int(16), "add_shl.vh.vh.uh", {i16_1, i16_2, i16(y % 16)})` |
| `Vx.h += vasr(Vy.h,r)` | `i16_1 + (i16_2 >> u16(y % 16))` | `hvx_builtin(Int(16), "add_shr.vh.vh.uh", {i16_1, i16_2, i16(y % 16)})` |
| `Vx.h += vasl(Vy.h,r)` | `u16_1 + (u16_2 * 16)` | `u16(hvx_builtin(Int(16), "add_shl.vh.vh.uh", {i16(u16_1), i16(u16_2), 4}))` |
| `Vx.h += vasl(Vy.h,r)` | `i16_1 + (i16_2 * 16)` | `hvx_builtin(Int(16), "add_shl.vh.vh.uh", {i16_1, i16_2, 4})` |
| `Vx.h += vasl(Vy.h,r)` | `u16_1 + (16 * u16_2)` | `u16(hvx_builtin(Int(16), "add_shl.vh.vh.uh", {i16(u16_1), i16(u16_2), 4}))` |
| `Vx.h += vasl(Vy.h,r)` | `i16_1 + (16 * i16_2)` | `hvx_builtin(Int(16), "add_shl.vh.vh.uh", {i16_1, i16_2, 4})` |
| `Vx.h += vasr(Vy.h,r)` | `i16_1 + (i16_2 / 16)` | `hvx_builtin(Int(16), "add_shr.vh.vh.uh", {i16_1, i16_2, 4})` |
| `vnormamt(Vx.h)` | `max(count_leading_zeros(i16_1), count_leading_zeros(~i16_1))` | `i16(hvx_builtin(UInt(16), "cls.vh", {u16(i16_1)}) + 1)` |
| `vnormamt(Vx.w)` | `max(count_leading_zeros(i32_1), count_leading_zeros(~i32_1))` | `i32(hvx_builtin(UInt(32), "cls.vw", {u32(i32_1)}) + 1)` |
| `Vx.uw = vrmpy(Vy.ub,r.ub)` | rfac = 4 and RDom r(0, 4) in `sum(u16(in_u8(rfac * x + r)))` | `u16(hvx_builtin(UInt(32), "add_4mpy.vub.ub", {tmp, Expr((uint32_t)0x01010101)}))` |
| `Vx.uw = vrmpy(Vy.ub,r.ub)` | rfac = 4 and RDom r(0, 4) in `sum(u16(in_u8(rfac * x + r)))` | `u16(hvx_builtin(UInt(32), "add_4mpy.vub.ub", {tmp, Expr((uint32_t)0x01010101)}))` |
| `Vx.uw = vrmpy(Vy.ub,r.ub)` | rfac = 4 and RDom r(0, 4) in `sum(u16(in_u8(rfac * x + r)) * u8(r))` | `u16(hvx_builtin(UInt(32), "add_4mpy.vub.ub", {tmp, Expr((uint32_t)0x03020100)}))` |
| `Vx.w  = vrmpy(Vy.ub,r.b)` | rfac = 4 and RDom r(0, 4) in `sum(i16(in_u8(rfac * x + r)) * i8(r))` | `i16(hvx_builtin(Int(32), "add_4mpy.vub.b", {tmp, (int32_t)0x03020100}))` |
| `Vx.uw = vrmpy(Vy.ub,Vz.ub)` | rfac = 4 and RDom r(0, 4) in `sum(u16(in_u8(rfac * x + r)) * in_u8(rfac * x + r + 32))` | `u16(hvx_builtin(UInt(32), "add_4mpy.vub.vub", {tmp, tmp_32}))` |
| `Vx.w  = vrmpy(Vy.ub,Vz.b)` | rfac = 4 and RDom r(0, 4) in `sum(i16(in_u8(rfac * x + r)) * in_i8(rfac * x + r + 32))` | `i16(hvx_builtin(Int(32), "add_4mpy.vub.vb", {tmp, tmp34}))` |
| `Vx.w  = vrmpy(Vy.b,Vz.b)` | rfac = 4 and RDom r(0, 4) in `sum(i16(in_i8(rfac * x + r)) * in_i8(rfac * x + r + 32))` | `i16(hvx_builtin(Int(32), "add_4mpy.vb.vb", {tmp2, tmp34}))` |
| `Vx.uw = vrmpy(Vy.ub,r.ub)` | rfac = 4 and RDom r(0, 4) in `sum(u32(in_u8(rfac * x + r)) * 34)` | `hvx_builtin(UInt(32), "add_4mpy.vub.ub", {tmp, Expr((uint32_t)0x22222222)})` |
| `Vx.w  = vrmpy(Vy.ub,r.b)` | rfac = 4 and RDom r(0, 4) in `sum(i32(in_u8(rfac * x + r)) * (-1))` | `hvx_builtin(Int(32), "add_4mpy.vub.b", {tmp, (int32_t)0xFFFFFFFF})` |
| `Vx.w  = vrmpy(Vy.ub,r.b)` | rfac = 4 and RDom r(0, 4) in `sum(i16(in_u8(rfac * x + r)) * (-1))` | `i16(hvx_builtin(Int(32), "add_4mpy.vub.b", {tmp, (int32_t)0xFFFFFFFF}))` |
| `Vxx.uw = vrmpy(Vyy.ub, r.ub, #z)` | rfac = 4 and RDom r(0, 4) in `sum(u32(in_u8(x + r)))` | `hvx_builtin(UInt(32), "add_4mpy.vub.ub.stencil", {hvx_builtin(UInt(8), "make_concat", {u8_1, in_u8(x + 4)}), 0x01010101})` |
| `Vxx.uw = vrmpy(Vyy.ub, r.ub, #z)` | rfac = 4 and RDom r(0, 4) in `sum(u32(in_u8(x + r)) * 34)` | `hvx_builtin(UInt(32), "add_4mpy.vub.ub.stencil", {hvx_builtin(UInt(8), "make_concat", {u8_1, in_u8(x + 4)}), 0x22222222})` |
| `Vxx.w = vrmpy(Vyy.ub, r.b, #z)` | rfac = 4 and RDom r(0, 4) in `sum(u32(in_u8(x + r)) * i8(r))` | `hvx_builtin(Int(32), "add_4mpy.vub.b.stencil", {hvx_builtin(UInt(8), "make_concat", {u8_1, in_u8(x + 4)}), (int32_t)0x03020100})` |
| `Vxx.w = vrmpy(Vyy.ub, r.b, #z)` | rfac = 4 and RDom r(0, 4) in `sum(i32(in_u8(x + r)) * i8(-r))` | `hvx_builtin(Int(32), "add_4mpy.vub.b.stencil", {hvx_builtin(UInt(8), "make_concat", {u8_1, in_u8(x + 4)}), (int32_t)0xFDFEFF00})` |
| `Vxx.w = vrmpy(Vyy.ub, r.b, #z)` | rfac = 4 and RDom r(0, 4) in `sum(i32(in_u8(x + r)) * (-1))` | `hvx_builtin(Int(32), "add_4mpy.vub.b.stencil", {hvx_builtin(UInt(8), "make_concat", {u8_1, in_u8(x + 4)}), (int32_t)0xFFFFFFFF})` |
| `Vx.h = vdmpy(Vy.ub, r.b)` | rfac = 2 and RDom r(0, 2) in `sum(i16(in_u8(rfac * x + r2)) * 34)` | `hvx_builtin(Int(16), "add_2mpy.vub.b", {hvx_builtin(UInt(8), "make_interleave", {in_u8(2 * x), in_u8(2 * x + 1)}), (int32_t)0x22222222})` |
| `Vx.w = vdmpy(Vy.h, r.b)` | rfac = 2 and RDom r(0, 2) in `sum(i32(in_i16(rfac * x + r2)) * 15246)` | `hvx_builtin(Int(32), "add_2mpy.vh.b", {hvx_builtin(Int(16), "make_interleave", {in_i16(2 * x), in_i16(2 * x + 1)}), (int32_t)0x01010101}) * 15246` |
| `Vx.w = vdmpy(Vy.h, r.b)` | rfac = 2 and RDom r(0, 2) in `sum(i32(in_i16(rfac * x + r2)) * (-1246))` | `hvx_builtin(Int(32), "add_2mpy.vh.b", {hvx_builtin(Int(16), "make_interleave", {in_i16(2 * x), in_i16(2 * x + 1)}), (int32_t)0x01010101}) * (-1246)` |
| `Vx.w = vdmpy(Vy.h, r.b)` | rfac = 2 and RDom r(0, 2) in `sum(i32(in_i16(rfac * x + r2 + 2)) * (-1246))` | `hvx_builtin(Int(32), "add_2mpy.vh.b", {hvx_builtin(Int(16), "make_interleave", {in_i16(2 * x + 2), in_i16(2 * x + 3)}), (int32_t)0x01010101}) * (-1246)` |
| `vtmpy(Vxx.b, r.b)` | rfac = 3 and RDom r(0, 3) in `sum(i16(in_i8(x + r3)))` | `hvx_builtin(Int(16), "vtmpy.vb.vb.b.b", {i8_1, in_i8(x + 2), 1, 1})` |
| `vtmpy(Vxx.ub, r.b)` | rfac = 3 and RDom r(0, 3) in `sum(i16(in_u8(x + r3)))` | `hvx_builtin(Int(16), "vtmpy.vub.vub.b.b", {u8_1, in_u8(x + 2), 1, 1})` |
| `vtmpy(Vxx.h, r.b)` | rfac = 3 and RDom r(0, 3) in `sum(i32(in_i16(x + r3)))` | `hvx_builtin(Int(32), "vtmpy.vh.vh.b.b", {i16_1, in_i16(x + 2), 1, 1})` |
| `Vx = vror(Vy, r)` | `in_u8(x / 128 + ((x + 5) % 128))` | `hvx_builtin(UInt(8), "vror.ub", {u8_1, 5});` |
| `Vx = vror(Vy, r)` | `in_u16(x / 64 + ((x + 6) % 64))` | `hvx_builtin(UInt(16), "vror.uh", {u16_1, 6});` |
| `Vx = vror(Vy, r)` | `in_u32(x / 32 + ((x + 7) % 32))` | `hvx_builtin(UInt(32), "vror.uw", {u32_1, 7});` |
| `Vx = vror(Vy, r)` | `in_i8(x / 128 + ((x + 2) % 128))` | `hvx_builtin(Int(8), "vror.b", {i8_1, 2});` |
| `Vx = vror(Vy, r)` | `in_i16(x / 64 + ((x + 3) % 64))` | `hvx_builtin(Int(16), "vror.h", {i16_1, 3});` |
| `Vx = vror(Vy, r)` | `in_i32(x / 32 + ((x + 4) % 32))` | `hvx_builtin(Int(32), "vror.w", {i32_1, 4});` |

## [Troubleshooting](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id25)

### [User controllable exits](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id26)

By default, when the Halide runtime detects an error, it does not trigger a
crash dump. However, this default behavior can be altered by making a call to
`halide_hexagon_set_error_fault_mask` in the application code before calling
the Halide pipeline.

halide_hexagon_set_error_fault_mask(void *user_context, bool enable_all_errors, int *lev_list);
    // user_context      : Can be NULL
    // enable_all_errors : boolean value which if set to true will trigger a crash dump for any error.
    // lev_list          : Zero-terminated list of Halide error codes that will trigger a crash dump
    //                     if encountered.
    //                     Ignored if enable_all_errors is true.
    Copy to clipboard

The following example uses `halide_hexagon_set_error_fault_mask` to trigger
a crash dump when Halide is out of memory.

int err_list[] = {halide_error_code_out_of_memory, 0};
    halide_hexagon_set_error_fault_mask(NULL, false, err_list);
    Copy to clipboard

The following example uses `halide_hexagon_set_error_fault_mask` to trigger a
crash dump for any error.

halide_hexagon_set_error_fault_mask(NULL, true, NULL);
    Copy to clipboard

Although this API is supported only in the two Offload modes
([Halide execution modes](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#executionmodes)), it does nothing in Simulator Offload mode and is only
useful in Device Offload mode.

For more information on the error codes that Halide supports, see
`halide_error_code_t` in `Halide.h` in your Halide installation.

### [Target malloc tracing](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id27)

`halide_malloc` is a function in the Halide runtime that a pipeline calls when
it must allocate memory for [Internal Buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#internalbuffers). In Device Offload mode,
calls to `halide_malloc` can be optionally traced by making a call to
`halide_set_malloc_tracing`. In Simulator Offload mode, calling
`halide_set_malloc_tracing` has no effect.

halide_hexagon_set_malloc_tracing(void *user_context, halide_hexagon_malloc_trace_level_t)
    // user_context      : Can be NULL
    // tracing_level     : halide_hexagon_malloc_no_trace produces no malloc trace
    //                     halide_hexagon_malloc_min_trace produces minimal trace
    //                     halide_hexagon_malloc_max_trace produces very verbose trace
    Copy to clipboard

The Halide runtime does not ask the system for memory every time
`halide_malloc` is called. It maintains a pre-allocated list of memory
buffers, and if a `halide_malloc` call can be satisfied by using one of
these buffers, that buffer is returned. If, however, a buffer is not available,
`halide_malloc` requests memory from the DSP (using the underlying malloc).

If the tracing level is set to `halide_hexagon_malloc_min_trace`, trace
messages are produced only if `halide_malloc` calls the underlying system
malloc. If the tracing level is set to `halide_hexagon_max_trace`, trace
messages are produced for every single call to `halide_malloc`.
`halide_hexagon_malloc_min_trace` is the recommended trace message level.

In Device Offload mode, `adb` can be used to view the tracing output. When you
set the tracing level to `halide_hexagon_malloc_min_trace`, you will see the
following output:

$> adb logcat | grep halide
    01-09 20:11:24.169  4794  4794 I halide  : HexagonMallocTraceLogger: 37101:Allocate(ptr=0xc805e900,size=65536)
    01-09 20:11:24.170  4794  4794 I halide  : HexagonMallocTraceLogger: 37101:Total_Memory(before=0,after=65536)
    01-09 20:11:24.171  4794  4794 I halide  : HexagonMallocTraceLogger: 28913:Allocate(ptr=0xc806e980,size=65536)
    01-09 20:11:24.171  4794  4794 I halide  : HexagonMallocTraceLogger: 28913:Total_Memory(before=65536,after=131072)
    01-09 20:11:24.172  4794  4794 I halide  : HexagonMallocTraceLogger: 37101:Allocate(ptr=0xc2880200,size=65536)
    01-09 20:11:24.173  4794  4794 I halide  : HexagonMallocTraceLogger: 37101:Total_Memory(before=131072,after=196608)
    01-09 20:11:24.173  4794  4794 I halide  : HexagonMallocTraceLogger: 28913:Allocate(ptr=0xc2890280,size=65536)
    01-09 20:11:24.175  4794  4794 I halide  : HexagonMallocTraceLogger: 28913:Total_Memory(before=196608,after=262144)
    01-09 20:11:24.277  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Free(ptr=0xc805e900,size=65536)
    01-09 20:11:24.280  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Total_Memory(before=262144,after=196608)
    01-09 20:11:24.284  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Free(ptr=0xc806e980,size=65536)
    01-09 20:11:24.288  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Total_Memory(before=196608,after=131072)
    01-09 20:11:24.291  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Free(ptr=0xc2880200,size=65536)
    01-09 20:11:24.295  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Total_Memory(before=131072,after=65536)
    01-09 20:11:24.299  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Free(ptr=0xc2890280,size=65536)
    01-09 20:11:24.302  4794  4794 I halide  : HexagonMallocTraceLogger: 28918:Total_Memory(before=65536,after=0)
    Copy to clipboard

### [Set heap grow size](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id28)

When more space is required in Device Offload mode, specify the amount of memory
to grow the heap by calling `halide_hexagon_mem_set_grow_size` before running a
pipeline. For example, to grow the heap by at least 16 MB but no more than
`MAX_INT32` (2 GB-1), do the following in the application-side (host-side)
code.

const long long int grow_min = 16*1024*1024;
    const long long int grow_max = MAX_INT32;
    halide_hexagon_mem_set_grow_size(NULL, grow_min, grow_max);
    Copy to clipboard

The default value is equivalent to
`halide_hexagon_mem_set_grow_size(0x100000/2, MAX_UINT64)`, which is a minimum
of 512 KB and the largest allowed maximum.

### [Debug target feature](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id29)

In the two Offload modes, use the `debug` target feature in your target when
running a generator to compile your pipeline. This feature will produce debug
messages that can be viewed using `adb`. For example, if the target when
running your generator is set to `target=arm-64-android-hvx-debug`,
you can view the debug message generated when running your application:

$> adb logcat | grep halide
    01-13 12:02:23.091 13601 13601 I halide  : halide_hexagon_device_and_host_malloc called.
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_device_malloc (user_context: 0x0, buf: 0x7fe84168c8)
    01-13 12:02:23.091 13601 13601 I halide  :     allocating buffer of 152 bytes
    01-13 12:02:23.091 13601 13601 I halide  :     halide_malloc size=152 ->
    01-13 12:02:23.091 13601 13601 I halide  :         0x7a5364e0e0
    01-13 12:02:23.091 13601 13601 I halide  :     host <- 0x7a5364e0e0
    01-13 12:02:23.091 13601 13601 I halide  :     Time: 9.375000e-03 ms
    01-13 12:02:23.091 13601 13601 I halide  : halide_hexagon_device_and_host_malloc called.
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_device_malloc (user_context: 0x0, buf: 0x7fe8416918)
    01-13 12:02:23.091 13601 13601 I halide  :     allocating buffer of 256 bytes
    01-13 12:02:23.091 13601 13601 I halide  :     halide_malloc size=256 ->
    01-13 12:02:23.091 13601 13601 I halide  :         0x7a536432a0
    01-13 12:02:23.091 13601 13601 I halide  :     host <- 0x7a536432a0
    01-13 12:02:23.091 13601 13601 I halide  :     Time: 8.906000e-03 ms
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_device_malloc (user_context: 0x0, buf: 0x7fe8416918)
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_copy_to_device (user_context: 0x0, buf: 0x7fe8416918)
    01-13 12:02:23.091 13601 13601 I halide  :     Time: 1.875000e-03 ms
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_device_malloc (user_context: 0x0, buf: 0x7fe84168c8)
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_copy_to_device (user_context: 0x0, buf: 0x7fe84168c8)
    01-13 12:02:23.091 13601 13601 I halide  :     Time: 3.640000e-04 ms
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_device_malloc (user_context: 0x0, buf: 0x7fe8416e00)
    01-13 12:02:23.091 13601 13601 I halide  :     allocating buffer of 14745728 bytes
    01-13 12:02:23.091 13601 13601 I halide  :     host_malloc len=14745728 ->
    01-13 12:02:23.091 13601 13601 I halide  : host_malloc: Using_libion
    01-13 12:02:23.091 13601 13601 I halide  : host_malloc: ion_alloc_fd succeeded
    01-13 12:02:23.091 13601 13601 I halide  :         0x7a4fdef000
    01-13 12:02:23.091 13601 13601 I halide  :     Time: 7.411500e-02 ms
    01-13 12:02:23.091 13601 13601 I halide  : Hexagon: halide_hexagon_run (user_context: 0x0, state_ptr: 0x7a5362c060 (163624160), name: offload_rpc.curved.s0.__outermost_argv, function: 0x64344128e0 (0))
    01-13 12:02:23.091 13601 13601 I halide  :     halide_hexagon_remote_get_symbol offload_rpc.curved.s0.__outermost_argv
    Copy to clipboard

Information about memory allocation and other calls made to the Halide runtime
can be gleaned when the `debug` target feature is used. However, this feature
is not supported in the two Standalone modes.

### [Useful environment variables](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id30)

Following are the important environment variables in Halide.

- `HL_HEXAGON_SIM_REMOTE`
    - In Simulator Offload mode, this variable is required to be set. It is the path
to the Hexagon simulator wrapper. For most use cases, set it to
`Halide/bin/hexagon_sim_remote`.

- `HL_HEXAGON_SIM_VERBOSE`
    - In Simulator Offload mode, set this variable to 1 to generate verbose messages
from the simulator.

- `HL_HEXAGON_MEMFILL`
    - In Simulator Offload mode, set this variable to a one-byte value as the value
to fill all of memory (that is, the initial value of uninitialized memory).
The default value is 0x1f.

- `HL_HEXAGON_TIMING`
    - In Simulator Offload mode, set this variable to 1 to enable timing mode with
data-backed caches in the simulator.

- `HL_HEXAGON_SIM_MIN_TRACE`
    - In Simulator Offload mode, this variable enables a minimal PC trace. Set the
value to the name you want for the trace file.

- `HL_HEXAGON_SIM_TRACE`
    - In Simulator Offload mode, this variable enables a full PC trace. Set the
value to the name you want for the trace file.

This variable generates more output than `HL_HEXAGON_SIM_TRACE`.

- `HL_HEXAGON_SIM_MEM_TRACE`
    - In Simulator Offload mode, this variable enables a memory trace. Set the value
to the name you want for the trace file.

This variable generates more output than `HL_HEXAGON_SIM_TRACE`.

- `HL_HEXAGON_SIM_DBG_PORT`
    - In Simulator Offload mode, this variable sets up the simulator for remote
debugging. Set the value to a TCP port number.

The simulator listens on this port number for remote debugging and write status
information during the simulation.

Use this variable with `hexagon-lldb`, which is available as part of the
Hexagon Tools.

- `HL_HEXAGON_PACKET_ANALYZE`
    - In Simulator Offload mode, set this variable to the name you want for the file
to which the simulator will write packet analysis information.

This data is in JSON format that is consumed by `hexagon-profiler` which is
available as part of the Hexagon Tools. `hexagon-profiler` is used to
visualize instruction packet level performance data such as packet commits and
stalls in HTML format.

- `HL_HEXAGON_SIM_CYCLES`
    - In Simulator Offload mode, set this variable to 1 to have the simulator write
the number of cycles taken to execute the pipeline. Output is written to
stdout.

- `HL_HEXAGON_SIM_STATS`
    - In Simulator Offload mode, set this variable to 1 to have the simulator write
performance statistics to stdout. The following example shows the format of
this data.

Done!
    T0: Insns=6761 Tcycles=3530
    T1: Insns=0 Tcycles=0
    T2: Insns=0 Tcycles=0
    T3: Insns=0 Tcycles=0
    T4: Insns=0 Tcycles=0
    T5: Insns=0 Tcycles=0
    Total: Insns=6761 Pcycles=21192
    Simulator speed=0.419547 Mips
    Ratio to Real Time (600 MHz) = ~1/456
    (elapsed time = 0.016115s)
    Copy to clipboard

- `HL_HEXAGON_CODE_SIGNER`
    - In the two Offload modes, set this variable to the path to a script that the
Halide compiler will call to sign the offloaded shared objects that the
pipeline  and the device side runtime has been compiled to. For more
information, see [Tools for signing Offload mode pipelines](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#toolsforsigningoffloadmodepipelines).

- `HL_DEBUG_CODEGEN`
    - This variable is available in all modes and can be used when running a
generator to make the Halide compiler output verbose data about the
compilation. It accepts integer values that signify increasing levels of
verbosity.

This variable is especially useful to inspect assembly code generated for your
Halide pipeline when using one of the two Offload modes.

### [Common compile time or runtime errors](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id31)

#### Crashed thread due to TLBMISS X (execution)

Sometimes in Device Standalone mode, you might get an error that shows up in
QXDM or mini-dm logs. For example:

[08500/04] 50:55.650 00: CDSP:############################### Process on cDSP CRASHED!!!!!!! ######################################## 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:--------------------- Crash Details are furnished below ------------------------------------ 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:process "/frpc/c04d4de0 main-conv3x3a16" crashed in thread "/frpc/c04d4de0 " due to TLBMISS X (execution) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:Crashed Shared Object ./libconv3x3a16_skel.so load address : 0xE040C000 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:fastrpc_shell_3 load address : E6F00000 and size : D8130 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:Fault PC : 0x0 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:LR : 0xE041420C 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:SP : 0xC5E875B8 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:Bad va : 0xE0400282 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:FP : 0xC5E875F8 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:SSR : 0x21F70C60 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:Call trace: 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E041420C>] halide_error_access_out_of_bounds+0x2AC: (./libconv3x3a16_skel.so) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E04192F8>] conv3x3a16_halide+0x1BB8: (./libconv3x3a16_skel.so) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E040FCB8>] conv3x3a16_run+0x124: (./libconv3x3a16_skel.so) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E040F74C>] conv3x3a16_skel_invoke+0x24C: (./libconv3x3a16_skel.so) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E6F76BCC>] mod_table_invoke+0x224: (fastrpc_shell_3) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E6F96EF0>] fastrpc_invoke_dispatch+0x170: (fastrpc_shell_3) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E6F7135C>] HAP_proc_adaptive_qos+0x3D0: (fastrpc_shell_3) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:[<E6F72F30>] _pl_fastrpc_uprocess+0x758: (fastrpc_shell_3) 0725 platform_qdi_driver.c
    [08500/04] 50:55.650 00: CDSP:----------------------------- End of Crash Report -------------------------------------------------- 0725 platform_qdi_driver.c
    Copy to clipboard

The logs say that a TLB miss, specifically TLBMISS X (execution), exception
occurred. Looking at the call stack we know that it occurred in the function
`halide_error_access_out_of_bounds`. This function is one of the error handler
functions in the Halide runtime. These functions are called when the pipeline
fails an assert during execution. The function handles the error by printing a
message and returning an error code through calls to `halide_error` and
`halide_print`.

In the two Standalone modes, you must provide implementations for these functions
in the DSP code. Following is an example implementation of the functions that
will work in Device Standalone mode.

#include "HAP_farf.h"
    #undef FARF_LOW
    #define FARF_LOW 1
    
    void halide_print(void *user_context, const char *str) {
        FARF(LOW, "%s", str);
    }
    
    void halide_error(void *user_context, const char *msg) {
        halide_print(user_context, msg);
    }
    Copy to clipboard

If you run the pipeline again, it will still crash as shown in the following
example because it was in an error handler function when the TLB Miss exception
occurred. In this case, the output buffer is accessed beyond its extent in
dimension 1.

[08500/02] 57:00.071 00: CDSP:fastrpc_spawn: Successfully spawned user PD for /frpc/c04d4de0 main-conv3x3a16 (pidA 27003) with asid 12 and load addr 0xe6f00000 0959 fastrpc_loader.c
    [08500/02] 57:00.073 00: CDSP:HVX power request successful for client 11 0677 fastrpc_kpower.c
    [08500/02] 57:00.085 770df:0c: CDSP: open_mod_table_open_dynamic: Module libconv3x3a16_skel.so opened successfully with handle 0x7203fa0 0540 mod_table.c
    [08500/02] 57:00.086 770df:0c: CDSP: perf_turbo 0051 conv3x3a16_i.c
    [08500/02] 57:00.086 00: CDSP:Max MIPS = 2147483647 1167 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Max bus bw = 1636000000 1177 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting app type to COMPUTE_CLIENT_CLASS 0731 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting clock - mipsTotal: -2, mipsPerThread: 2147483647 0327 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting bus_bw - bwBytePerSec: 6543114240, usagePercentage: 100 0342 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting latency 10 0361 fastrpc_kpower.c
    [08500/02] 57:00.088 770df:0c: CDSP: power_on 0031 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: Before Call 0099 conv3x3a16_i.c
    [08500/00] 57:00.090 770df:0c: CDSP: In halide_error 0115 conv3x3a16_i.c
    [08500/00] 57:00.090 770df:0c: CDSP: Output buffer output is accessed at 1023, which is beyond the max (1021) in dimension 1 0111 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: After Call = -4 0101 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: error = -4 0104 conv3x3a16_i.c
    [08500/03] 57:00.090 770df:0c: CDSP: Error 0xffffffff: open_mod_table_handle_invoke failed for handle 0x7203fa0, sc 0x5030100 0785 mod_table.c
    [08500/02] 57:00.091 00: CDSP:HVX power released for client 11 0681 fastrpc_kpower.c
    [08500/02] 57:00.092 00: CDSP:fastrpc_kill done for pidA 27003 1143 fastrpc_loader.c
    Copy to clipboard

#### Accessing buffers out of bounds and violating constraints

Sometimes a Halide pipeline might crash with the error code,
`halide_error_code_access_out_of_bounds`. Typically, this error occurs because
the pipeline tried to access a buffer beyond the extent of a dimension of the
buffer passed during execution.

In Device Offload mode, this error message can be seen using `adb`.
In Simulator Offload mode, this error is seen on stdout.

In the two Standalone modes, you must provide implementations of `halide_error`
and `halide_print`. Then, in Device Standalone mode, these messages can be seen
using QXDM or mini-dm, and in Simulator Standalone mode these messages will appear
on stdout.

Following is an example in Device Standalone mode (mini-dm).

[08500/02] 57:00.071 00: CDSP:fastrpc_spawn: Successfully spawned user PD for /frpc/c04d4de0 main-conv3x3a16 (pidA 27003) with asid 12 and load addr 0xe6f00000 0959 fastrpc_loader.c
    [08500/02] 57:00.073 00: CDSP:HVX power request successful for client 11 0677 fastrpc_kpower.c
    [08500/02] 57:00.085 770df:0c: CDSP: open_mod_table_open_dynamic: Module libconv3x3a16_skel.so opened successfully with handle 0x7203fa0 0540 mod_table.c
    [08500/02] 57:00.086 770df:0c: CDSP: perf_turbo 0051 conv3x3a16_i.c
    [08500/02] 57:00.086 00: CDSP:Max MIPS = 2147483647 1167 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Max bus bw = 1636000000 1177 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting app type to COMPUTE_CLIENT_CLASS 0731 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting clock - mipsTotal: -2, mipsPerThread: 2147483647 0327 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting bus_bw - bwBytePerSec: 6543114240, usagePercentage: 100 0342 fastrpc_kpower.c
    [08500/02] 57:00.086 00: CDSP:Setting latency 10 0361 fastrpc_kpower.c
    [08500/02] 57:00.088 770df:0c: CDSP: power_on 0031 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: Before Call 0099 conv3x3a16_i.c
    [08500/00] 57:00.090 770df:0c: CDSP: In halide_error 0115 conv3x3a16_i.c
    [08500/00] 57:00.090 770df:0c: CDSP: Output buffer output is accessed at 1023, which is beyond the max (1021) in dimension 1 0111 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: After Call = -4 0101 conv3x3a16_i.c
    [08500/02] 57:00.090 770df:0c: CDSP: error = -4 0104 conv3x3a16_i.c
    [08500/03] 57:00.090 770df:0c: CDSP: Error 0xffffffff: open_mod_table_handle_invoke failed for handle 0x7203fa0, sc 0x5030100 0785 mod_table.c
    [08500/02] 57:00.091 00: CDSP:HVX power released for client 11 0681 fastrpc_kpower.c
    [08500/02] 57:00.092 00: CDSP:fastrpc_kill done for pidA 27003 1143 fastrpc_loader.c
    Copy to clipboard

This message indicates that the buffer `output` was accessed at 1023 beyond
the maximum in dimension 1 (1021). In other words, the pipeline tried to access
the buffer beyond the number of rows in the external buffer
([External Buffers](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide.html#externalbuffers)) passed for `output`.

When a Halide pipeline is compiled into object code, the Halide compiler adds some
asserts based on the schedule used. For example, consider the following code
example.

Func(output)
      .tile(x, y, xi, yi, 128, 4, TailStrategy::RoundUp)
      .vectorize(xi)
      .unroll(yi);
    Copy to clipboard

This schedule uses `TailStrategy::RoundUp`. While tiling `output`, the
remainder loops over `x` and `y` are rounded up to the next multiple of the
split factor in that dimension (128 in `x` and 4 in `y`).

But if at runtime, an external buffer with only 1022 rows was passed as the
buffer for `output`. The loop over `y`, which was rounded up to the next
multiple of 4, will execute until 1024, beyond the extent of the buffer passed for
`output`.

Similarly, the following code example will cause asserts (checked during execution
of the pipeline) to be added to the compiled Halide pipeline.

input.dim(0).set_min(0);
    input.dim(1).set_min(0);
    Expr input_stride = input.dim(1).stride();
    input.dim(1).set_stride((input_stride/128) * 128);
    input.set_host_alignment(128);
    Copy to clipboard

If during execution the external buffer passed for `input` is such that
its stride in dimension 1 is not a multiple of 128,
`halide_error_code_constraint_violated` will be generated.

If the host pointer of external buffer is not aligned to 128,
`halide_error_code_unaligned_host_ptr` will be generated.

## [Signing for deployment](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id32)

Software applications can use the cDSP only if one of the following holds true.

- The code executing on the cDSP is signed.
- The device is a test device and has a testsig installed.
- The code that is offloaded to the DSP is offloaded by setting up an unsigned
User PD (available only on SM8150 or later devices).

### [Tools for signing Offload mode pipelines](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id33)

In Device Offload mode, Halide pipelines are host (Arm) executables containing
embedded Hexagon shared objects that are offloaded to the cDSP at runtime.
Several tools are provided to help extract these shared objects for signing.
The method for signing the objects itself is not defined here because it
depends on how the target device has been configured.

If signing can be done during Halide compilation (running your generator), use
`hl_signnow` from your Halide installation (it is in the `Halide/tools`
directory). Replace the `cp $1 $2` step in the script with the method
required to sign a shared object.

The following example shows signing Halide pipeline online, when producing
pipeline code (running the generator).

$> export HL_HEXAGON_CODE_SIGNER=/path/to/hl_signnow
    $> <build and run the generator>
     ...
     ...
     hl_signnow: signing /tmp/hvx_unsignedy3t97n.so as /tmp/hvx_signedYkst0t.so
     hl_signnow: signing /tmp/hvx_unsignedFOHXXK.so as /tmp/hvx_signedCZksV1.so
    Copy to clipboard

If signing must be done separately from compilation, use `hl_signsav` and
`hl_signuse` from your Halide installation (they are in the `Halide/tools`
directory). Signing is now a three-step process.

1. Extract and save the shared objects when running the generator by setting
the `HL_HEXAGON_CODE_SIGNER` environment variable to `hl_signsav`.
2. Sign the extracted shared objects.
3. Plug the signed shared objects back into the application by rebuilding the
application (running the generator again).

For example:

$> /bin/rm -rf /tmp/hl_sign_$USER
    
    $> export HL_HEXAGON_CODE_SIGNER=/path/to/hl_signsav
    
    $> <build and run the generator>
     ...
     ...
     ...
     saving /tmp/hvx_unsignedjAQVqB.so as /tmp/hl_sign_$USER/lib000.so
     saving /tmp/hvx_unsignedFOHXXK.so as /tmp/hl_sign_$USER/lib001.so
    
    $> ls /tmp/hl_sign_$USER
     lib000.so    lib001.so
    
    $> <sign the libraries>
    
    $> export HL_HEXAGON_CODE_SIGNER=/path/to/hl_signuse
    
    $> <build and run the generator>
     ...
     ...
     ...
     hl_signuse: copying /tmp/hl_sign_$USER/lib000.so to /tmp/hvx_signedSRbsjF.so
     hl_signuse: copying /tmp/hl_sign_$USER/lib001.so to /tmp/hvx_signedXLAWUD.so
    Copy to clipboard

The `/tmp/hl_sign_$USER` directory will be empty after the second build step
because the libraries are moved to a subdirectory called `done` as they are
used. If the application is to be built again, move these objects back up one
level.

You can also use `hl_signsav` to simply obtain the Hexagon shared objects for
examination whenever needed.

Note

On Windows devices, the shared objects are saved in `/temp/hl_sign_%USERNAME%`.

For Windows usage, see `Halide/tools/hl-sign-notes.cmd.txt`.

### [Unsigned mode execution](https://docs.qualcomm.com/doc/80-PD002-1/topic/halide_for_hvx.html#id34)

Halide pipelines can be run on the Hexagon DSP in Unsigned mode (unsigned User
PD) by calling the `halide_hexagon_init_unsigned_mode()` function from the
application before calling the Halide pipeline. This function must be called
before any Hexagon-related function that uses FastRPC. Calling this function
is the first thing that an application should do to ensure that an unsigned
User PD is set up for offloading to the cDSP.

The following example shows the application-side (host-side) code use to set up
Unsigned mode execution.

int main(int argc, char **argv) {
      const int width = 292;
      const int height = 274;
      int iterations = 10;
    
      if (halide_hexagon_init_unsigned_mode(NULL) == -1) {
        printf("Unable to set halide_hexagon_init_unsigned_mode\n");
      }
    
      const int VLEN=128;
      int stride_y = (width + (VLEN)-1) & (-(VLEN));
      halide_dimension_t x_dim{0, width, 1};
      halide_dimension_t y_dim{0, height, stride_y};
      halide_dimension_t io_shape[2] = {x_dim, y_dim};
    
      ...
      ...
    }
    Copy to clipboard

Note

This feature is available only on Snapdragon SM8150 devices and later.

Last Published: Jul 08, 2024

[Previous Topic
The Halide programming language](https://docs.qualcomm.com/bundle/publicresource/80-PD002-1/topics/halide.md) [Next Topic
Autoscheduler](https://docs.qualcomm.com/bundle/publicresource/80-PD002-1/topics/autoscheduler.md)