# Vector instructions

This chapter provides an overview of the HVX load/store instructions,
compute instructions, VLIW packet rules, dependency, and scheduling
rules.

[Slot resource latency](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-hvx-slot-resource-latency-summary)
summarizes Hexagon slot, HVX resource, and instruction latency for
the instruction categories.

## VLIW packing rules

HVX provides the following resources for vector instruction execution:

- load
- store
- shift
- permute/shift
- two multiply

Each HVX instruction consumes some combination of these resources, as
defined in [vector instuction resource usage](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-hvx-vector-instruction-resource-usage).
VLIW packets cannot oversubscribe resources.

An instruction packet can contain up to four instructions, plus an
end loop. The instructions inside the packet must obey the packet
grouping rules described in [vector instruction](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-hvx-vector-instruction).

The permute resource allows shift resource instructions.

Note

The assembler should check and flag invalid packet
combinations. When an invalid packet executes, the behavior is
undefined.

### Double vector instructions

Certain instructions consume a pair of resources, either both the
shift and permute as a pair or both multiply resources as another
pair. Such instructions are referred to as double vector instructions
because they use two vector compute resources.

Halfword by halfword multiplies are double vector instructions,
because they consume both the multiply resources.

### Vector instruction resource usage

[HVX execution resource usage](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-tbl-hvx-execution-resource-usage)
summarizes the resources that an HVX instruction uses during execution.
It also specifies the order in which the Hexagon assembler tries to build
an instruction packet from the most to least stringent.

HVX execution resource usage

| **Instruction** | **Used resources** |
| --- | --- |
| Histogram | All |
| Unaligned memory access | Load, store, and permute |
| Double vector cross-lane permute | Permute and shift |
| Cross-lane permute | Permute |
| Shift | Shift |
| Double vector & halfword multiplies | Both multiply |
| Single vector | Either multiply |
| Double vector ALU operation | Either shift and permute or both multiply |
| Single vector ALU operation | Shift, permute, or multiply |
| Aligned memory | Shift, permute, or multiply and one of load or store |
| Aligned memory (.tmp/.new) | Load or store only |
| Scatter (single vector indexing) | Store and one of shift, permute, or multiply |
| Scatter (double vector indexing) | Store and either shift and permute or both multiply |
| Gather (single vector indexing) | Load and one of shift, permute, or multiply |
| Gather (double vector indexing) | Load and either shift and permute or both multiply |

### Vector instruction

Vector instructions map to certain Hexagon slots. A special subset of
ALU instructions (including nclude lookup table, splat, insert, and
addition/subtraction with Rt) that require either the full 32 bits of
the scalar Rt register (or 64 bits of Rtt) map to slots 2 and 3.

HVX instruction to Hexagon slots mapping

| **Instruction** | **Used Hexagon slots** | **Additional restrictions** |
| --- | --- | --- |
| Aligned memory load | 0 or 1 |  |
| Aligned memory store | 0 |  |
| Unaligned memory load/store | 0 | Slot 1 must be empty. Maximum of 3 instructions allowed in the<br>packet. |
| Scatter | 0 |  |
| Gather | 1 | .new store in slot 0 |
| Vextract |  | Only instruction in packet |
| Histogram | 0, 1, 2, or 3 | .tmp load in same packet |
| Multiplies | 2 or 3 |  |
| Using full 32-64 bit R | 2 or 3 |  |
| Simple ALU, permute, shift | 0, 1, 2, or 3 |  |

## Vector load/store

VMEM instructions move data between VRF and memory. VMEM instructions
support the following addressing modes.

- Indirect
- Indirect with offset
- Indirect with auto-increment (immediate and register/modifier
register) For example:

V2 = vmem(R1+#4)    // Address R1 + 4 * (vector-size) bytes
        V2 = vmem(R1++M1)   // Address R1, post-modify by the value of M1
        Copy to clipboard

The immediate increment and post increments values are vector counts.
So the byte offset is in multiples of the vector length.

To facilitate unaligned memory access, unaligned load and stores are
available. The VMEMU instructions generate multiple accesses to the
L2 cache and use the permute network to align the data.

### Load-temp and load-current

The load-temp and load-current forms allow immediate use of load data
within the same packet. A load-temp instruction does not write the
load data into the register file. A register must be specified, but
it is not overwritten. Because the load-temp instruction does not
write to the register file, it does not consume a vector ALU
resource.

A load-temp destination register cannot be an accumulator register
within the packet. The behavior is considered undefined.

{
       V2.tmp = vmem(R1+#1)                 // Data loaded to a tmp
       V5:4.ub = vadd(V3.ub,V2.ub)          // Use loaded data as V2 source
       V7:6.uw = vrmpy(V5:4.ub,R5.ub, #0)
    }
    Copy to clipboard

Load-current is similar to load-temp, but consumes a vector ALU
resource as the loaded data writes to the register file.

{
        V2.cur = vmem(R1+#1)                 // Data loaded into a V2
        V3 = valign(V1,V2, R4)               // Load data used immediately
        V7:6.ub = vrmpy(V5:4.ub, R5.ub,#0)
    }
    Copy to clipboard

### New-value store

VMEM store instructions can store a newly generated value from a vector register
in the same packet. The instructions do not consume a vector ALU resource as they
do not read nor write the register file. This feature is expressed in assembly
language by appending the suffix `.new` to the source register. The store must be in slot 0.

{
     V20.w = vmax(V0.w, V1.w)
     vmem(R1+#1)= V20.new    // Store V20 that was generated in the current packet
    }
    Copy to clipboard

### Predicate stores

An entire VMEM write can also be suppressed by a scalar predicate.

if P0 vmem(R1++M1) = V20    // Store V20 if P0 is true
    Copy to clipboard

A vector predicate register can issue and control a partial
byte-enabled store.

if Q0 vmem(R1++M1) = V20    // Store bytes of V20 where Q0 is true
    Copy to clipboard

## Scatter and gather

Unlike vector loads and stores that access contiguous vectors in
memory, scatter and gather allow for noncontiguous memory access of
vector data. With scatter and gather, each element can independently
index into a region of memory. This allows for applications that
otherwise do not map well to the SIMD parallelism that HVX provides.

A scatter transfers data from a contiguous vector to noncontiguous
memory locations. Similarly, gather transfers data from noncontiguous
memory locations to a contiguous vector. In HVX, scatter is a vector
register to noncontiguous memory transfer and gather is a
noncontiguous memory to contiguous memory transfer. Additionally, HVX
supports scatter-accumulate instructions that atomically add.

To maximize performance and efficiency, the scatter and gather
instructions define a bounded region that must contain noncontiguous
accesses. This region must be within VTCM (scatter/gather capable)
and be within one translatable page. A vector specifies offsets from
the base of the region for each element access.
[Sources for noncontiguous accesses](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-tbl-sources-for-noncontiguous-accesses)
lists the three sources that specify the noncontiguous accesses of a scatter or gather:

Sources for noncontiguous accesses: (Rt, Mu, Vv)

| **Source** | **Meaning** |
| --- | --- |
| Rt | Base address of the region |
| Mu | Byte offset of last valid byte of the region (for example,<br>region size - 1) |
| Vv or Vvv | Vector of byte offsets for the accesses.<br><br><br>Double-vector is used when the offset width is double the<br>data width |

To form an HVX gather (memory to memory), vgather is paired with a
vector store to specify the destination address. A scatter is
specified with a single instruction. Ignoring element sizes,

[Basic scatter and gather instructions](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-tbl-basic-scatter-and-gather-instructions)
describes the basic forms of scatter and gather instructions:

Basic scatter and gather instructions

| **Instruction** | **Behavior** |
| --- | --- |
| vscatter(Rt,Mu,Vv)=Vw | Write data in Vw to noncontiguous addresses specified by<br>(Rt,Mu,Vv) |
| vscatter(Rt,Mu,Vv)+=Vw | Atomically add data in Vw to noncontiguous addresses specified<br>by (Rt,Mu,Vv) |
| { vtmp=vgather(Rt,Mu,Vv);<br><br><br>vmem(Addr)=vtmp.new<br><br><br>} | Read data from noncontiguous addresses specified by (Rt,Mu,Vv)<br>and write the data contiguously to the aligned address |

## Memory instruction slot combinations

VMEM load/store instructions and scatter/gather instructions can be
grouped with normal scalar load/store instructions.

[Valid VMEM load/store and scatter/gather combinations](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-tbl-valid-vmem-load-store-and-scatter-gather-combinations)
provides the valid grouping combinations for HVX memory instructions.
A combination that is not present in the table is invalid, and should
be rejected by the assembler. The hardware generates an invalid packet
error exception.

Valid VMEM load/store and scatter/gather combinations

| **Slot 0 instruction** | **Slot 1 instruction** |
| --- | --- |
| VMEM Ld | Nonmemory |
| VMEM St | Nonmemory |
| VMEM Ld | Scalar Ld |
| Scalar St | VMEM Ld |
| Scalar Ld | VMEM Ld |
| VMEM St | Scalar St |
| VMEM St | Scalar Ld |
| VMEM St | VMEM Ld |
| VMEMU Ld | Empty |
| VMEMU St | Empty |
| .new VMEM St | Gather |
| Scatter | Nonmemory |
| Scatter | Scalar St |
| Scatter | Scalar Ld |
| Scatter | VMEM Ld |

## Special instructions

### Histogram

HVX contains a specialized histogram instruction. The vector register
file divides into four histogram tables each of 256 entries (32
registers by 8 halfwords). A temporary VMEM load instruction fetches
a line from memory. The top five bits of each byte provide a register
select, and the bottom bits provide an element index. The value of
the element in the register file is incremented. The programmer must
clear the registers before use.

Example:

{
        V31.tmp VMEM(R2)    // Load a vector of data from memory
        VHIST();            // Perform histogram using counters in VRF and indexes
                            // from temp load
    }
    Copy to clipboard

## Instruction latency

Latencies are implementation-defined and can change with future
versions.

HVX packets execute over multiple clock cycles, but typically in a
pipelined manner to issue and complete a packet on every context
cycle. The contexts are time interleaved to share the hardware such
that using all contexts might be required to reach peak compute
bandwidth.

With a few exceptions (for example, histogram and extract), results
of packets generate within a fixed time after execution starts. But,
when the sources are required varies. Instructions that need more
pipelining require early sources. Only HVX registers are early source
registers. Early source operands include:

- Input to the multiplier. For example, V3.h = vmpyh(V2.h, V4.h). V2
and V4 are multiplier inputs. For multiply instructions with
accumulation, the accumulator is not considered an early source
multiplier input.
- Input to shift/bit count instructions. Only the shifted or counted
register is considered early source. Accumulators are not early
sources.
- Input to permute instructions. Only permuted registers are considered
early source (not an accumulator).
- Unaligned store data is an early source.

An early source register produced in the previous vector packet can
incur an interlock stall. Software should strive to schedule an
intervening packet between the producer and an early source consumer.

The following example shows interlock cases:

V8 = vadd(V0,V0)
    V0 = vadd(V8,V9)     // NO STALL
    V1 = vmpy(V0,R0)     // STALL due to V0
    V2 = vsub(V2,V1)     // NO STALL on V1
    V5:4 = vunpack(V2)   // STALL due to V2
    V2 = vadd(V0,V4)     // NO STALL on V4
    Copy to clipboard

### Avoiding accumulator stalls

A 3 vector source using an accumulator (Vx) must be produced in the prior
packet to avoid stalling.

The HVX\_ACC\_ORDER PMU event indicates stalling due to not following this rule.

The following example shows accumulator stall with 3 vector source instructions.

{
      V18.w += vrmpy(V0.b,V1.b)
    }
    {
      V24.w += vrmpy(V2.b, V3.b)    // Previous packet does not produces V24
    }
    Copy to clipboard

The following examples show non-stalling accumulator with 3 vector source instructions.

// Example 1:
    {
      V24.w += vrmpy(V0.b,V1.b)
    }
    {
      V24.w += vrmpy(V2.b, V3.b)    // Previous packet produces V24
    }
    
    // Example 2:
    {
      V24 = #0
    }
    {
      V24.w += vrmpy(V2.b, V3.b)    // Previous packet produces V24
    }
    Copy to clipboard

## Slot/resource/latency summary

[HVX slot/resource/latency summary](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html#v79-tbl-hvx-slot-resource-latency-summary)
summarizes the Hexagon slot, HVX  resource, and latency requirements for HVX
instruction types.

|  |  | **Core slots** | **Core slots** | **Core slots** | **Core slots** | **Vector resources** | **Vector resources** | **Vector resources** | **Vector resources** | **Vector resources** | **Vector resources** | **Input latency** |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **Category** | **Variation** | **3** | **2** | **1** | **0** | **Id** | **mpy** | **mpy** | **shift** | **xlane** | **st** | **Input latency** |
| **ALU** | **1 vector** | any | any | any | any |  | any | any | any | any |  | 1 |
| **ALU** | **2 vectors** | any | any | any | any |  | either pair | either pair | either pair | either pair |  | 1 |
| **ALU** | **Rt** | either | either |  |  |  | either | either |  |  |  | 1 |
| **Abs-diff** | **1 vector** | either | either |  |  |  | either | either |  |  |  | 2 |
| **Abs-diff** | **2 vector** | either | either |  |  |  |  |  |  |  |  | 2 |
| **Multiply** | **by 8 bits; 1 vector** | either | either |  |  |  | either | either |  |  |  | 2 |
| **Multiply** | **by 8 bits; 2 vector** | either | either |  |  |  |  |  |  |  |  | 2 |
| **Multiply** | **by 16 bits** | either | either |  |  |  |  |  |  |  |  | 2 |
| **Cross-lane** | **1 vector** | any | any | any | any |  |  |  |  |  |  | 2 |
| **Cross-lane** | **2 vectors** | any | any | any | any |  |  |  |  |  |  | 2 |
| **Shift or count** | **1 vector** | any | any | any | any |  |  |  |  |  |  | 2 |
| **load** | **aligned** |  |  | either | either |  | any | any | any | any |  | - |
| **load** | **aligned; .tmp** |  |  | either | either |  |  |  |  |  |  | - |
| **load** | **aligned; .cur** |  |  | either | either |  | any | any | any | any |  | - |
| **load** | **unaligned** |  |  |  |  |  |  |  |  |  |  | - |
| **store** | **aligned** |  |  |  |  |  | any | any | any | any |  | 1 |
| **store** | **aligned; .new** |  |  |  |  |  |  |  |  |  |  | 0 |
| **store** | **unaligned** |  |  |  |  |  |  |  |  |  |  | 2 |
| **gather (needs .new store)** | **1 vector** |  |  |  |  |  | any | any | any | any |  | 1 |
| **gather (needs .new store)** | **2 vector** |  |  |  |  |  | either pair | either pair | either pair | either pair |  | 1 |
| **scatter** | **1 vector** |  |  |  |  |  | any | any | any | any |  | 1 |
| **scatter** | **2 vector** |  |  |  |  |  | either pair | either pair | either pair | either pair |  | 1 |
| **histogram (needs .tmp load)** | **histogram (needs .tmp load)** | any | any | any | any |  |  |  |  |  |  | 2 |
| **extract** |  |  |  |  |  |  |  |  |  |  |  | 1 |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |

Last Published: Jan 16, 2025

[Previous Topic
Memory](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/memory.md) [Next Topic
HVX floating point](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/hvx-floating-point.md)

Source: [https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html](https://docs.qualcomm.com/doc/80-N2040-61/topic/vector-instructions.html)