# Memory

The Hexagon unified byte addressable memory has a single 32-bit
virtual address space with little-endian format. All addresses,
whether used by a scalar or vector operation go through the memory
management unit (MMU) for address translation and protection.

## Alignment

Unlike on the scalar processor, an unaligned pointer (a pointer that
is not a multiple of the vector size) does not cause a memory fault
or exception. When using a general VMEM load or store, the
least-significant bits of the address are ignored.

VMEM(R0) = V1   // Store to R0 & ~(0x3F)
    Copy to clipboard

The intra-vector addressing bits are ignored.

Unaligned loads and stores are also explicitly supported through the
VMEMU instruction.

V0 = VMEMU(R0)   // Load a vector from R0 regardless of alignment
    Copy to clipboard

## HVX local memory: VTCM

HVX supports a local memory called vector tightly coupled memory
(VTCM) for scratch buffers and scatter/gather operations. The size of
the memory is implementation-defined. The size is discoverable from
the configuration table. VTCM needs normal virtual to physical
translation just like other memory. This memory has higher
performance and lower power.

Use VTCM for intermediate vector data, or as a temporary buffer. It
serves as the input or output of the scatter/gather instructions. The
following are advantages of using VTCM as the intermediate buffer:

- Guarantees no eviction (vs. L2 if the set is full)
- Faster than L2$ (does not have the overhead of cache management, like
association)
- Reduces L2$ pressure
- Lower power than L2$
- Supports continuous read and write for every packet without
contention

In addition to HVX VMEM access, normal Hexagon memory access
instructions can access this memory.

The following conditions are invalid for VTCM access:

- Using a page size larger than the VTCM size.
- Attempting to execute instructions from VTCM; including speculative
access.
- Scalar VTCM access when the HVX fuse is blown (disabled).
- Load-locked or store-conditional to VTCM.
- A memw\_phys instruction load from VTCM while more than one thread is
active.
- Accessing VTCM while HVX is not fully powered up or VTCM banks are
asleep.
- Unaligned access crossing between VTCM and non-VTCM pages.

## Scatter and gather

Scatter and gather instructions allow for per-element random access
of VTCM. Each element can specify an independent address to read
(gather) or write (scatter). Gather for HVX is a vector copy from
noncontiguous addresses to an aligned contiguous vector location.
Gather operations use slot 0 + slot 1 on the scalar side, and HVX
load + store resources.

Gather is formed by two instructions, one for reading from VTCM and
one for storing to VTCM:

{
       Vtmp.h = vgather(Rt,Mu,Vv.h)
       vmem(Rs+#1) = Vtmp.new
    }
    Copy to clipboard

If the input data of gather is in DDR, it must first be copied to
VTCM and gathered from there. Gather cannot be performed directly on
DDR or L2$ contents.

Vector gather (vgather) operations transfer elemental copies from a
large region in VTCM to a smaller vector-sized region in VTCM. Each
instruction can gather up to 64 elements. Gather supports halfword
and word granularity. Emulate byte gather through vector predicate
instructions using two packets.

Use gather for large lookup tables (up to VTCM size).

Except for scatters and following scatters, these instructions are
ordered with the following operations. However, accesses from
elements of the same scatter or gather instruction are not ordered.
The primary ordered case is loading from a gather result or from a
scatter region.

Operations via scatter or gather usually perform better via scatter.
The following conditions are invalid for scatter or gather access:

- The scatter (write) or gather (read) region covers more than one
page, or the M source (length - 1) is negative. An exception is generated otherwise.
- Any of the accesses are not within VTCM. This includes the gather
target addresses as well. An exception is generated otherwise.
- Both a gather region instruction and a scatter instruction in the
same packet.

## Memory type

HVX memory instructions (VMEM or scatter/gather) that target
device-type memory raise a VMEM address error exception. It is also
illegal to use HVX memory instructions while the MMU is off.

Note

HVX is designed to work with L2 cache, L2TCM, or VTCM. Mark
memory as L2-cacheable for L2 cache data and uncached for data that
resides in L2TCM or VTCM.

## Nontemporal

A VMEM instruction can have an optional nontemporal attribute,
specified in assembly with a :nt appendix. Marking an instruction
nontemporal indicates to the microarchitecture that the data is no
longer needed after the instruction. The cache memory system uses
this information to inform replacement and allocation decisions.

## Permissions

Unaligned VMEMU instructions that are naturally aligned only require
MMU permissions for the accessed line. The hardware suppresses
generating an access to the unused portion.

The byte-enabled conditional VMEM store instruction requires MMU
permissions regardless of whether bytes are performed or not. In
other words, the state of the Q register is not considered when
checking permissions.

## Ordering

The HVX coprocessor follows the same sequentially consistent memory
model as the scalar core for coprocessor packets. Coprocessor threads
interleave their coprocessor memory operations with one another in an
arbitrary but fair manner. This results in a consistent program order
that is globally observable by threads in the same order.

The only exception to this rule is the scatter operations. Scatter
operation memory updates are unordered with respect to each other.
Their internal transactions are also unordered.

Direct memory access (DMA) through the external AXI slave port are
also considered noncoherent with the coprocessor threads and require
explicit memory synchronizations through the use of the store release
or polling of the DMA descriptor performed by the scalar core.

## Atomicity

[Atomicity of types of memory accesses](https://docs.qualcomm.com/doc/80-N2040-61/topic/memory.html#v79-tbl-atomicity-of-types-of-memory-accesses)
describes the size or alignment of decomposed atomic operations for
different types of memory accesses. When an access is not fully atomic,
an observer can see atomic components of the access.

Atomicity of types of memory accesses

| **Access type** | **Atomic size** |
| --- | --- |
| Scalar<br><br><br>A mem-op is two accesses | Access size |
| Aligned vector | Base vector size |
| Unaligned vector | 1 B |
| Scatter | 1 B |
| Scatter-accumulate (read-modify-write) | 1 B<br><br><br>A larger read-modify-write can decompose into multiple<br>equivalent smaller read-modify- writes. |
| Gather read | 1 B |
| Gather write | 1 B |

Individual scatter and gather accesses are only guaranteed atomic
with other scatter or gather accesses.

## Maximizing performance of the vector memory system

The HVX vector processor is attached directly to the L2 cache. VMEM
loads/stores move data to/from L2 and do not use L1 data cache. To
ensure coherency with L1, VMEM stores check L1 and invalidate on hit.

### Minimize VMEM access

Accessing data from the VRF is far cheaper in cycles and power than
accessing data from memory. The simplest way to improve memory system
performance is to reduce the number of VMEM instructions. Avoid
moving data to/from memory when VRF can host it instead.

### Use aligned data

VMEMU instruction access multiple L2 cache lines and are expensive in
bandwidth and power. Where possible, align data structures to vector
boundaries. Padding the image is often the most effective technique
to provide aligned data.

### Avoid store to load stalls

A VMEM load instruction that follows a VMEM store to the same address
incurs a store-to-load penalty. The store must fully reach L2 before
the load starts, thus the penalty can be quite large. To avoid
store-to-load stalls, there should be approximately 15 packets of
intervening work.

### L2FETCH

Use the L2FETCH instruction to prepopulate the L2 cache with data
prior to using VMEM loads.

L2FETCH is best performed in sizes less than 8 KB and issued at least
several hundred cycles prior to using the data. If the L2FETCH
instruction is issued too early, data can be evicted before use. In
general, prefetching and processing on image rows or tiles works
best.

Prefetch L2 cacheable data that VMEM uses, even if it is not used in
the computation. Software pipelined loops often overload data unused
data. Even though the pad data is not used in computation, the VMEM
stalls if it has not been prefetched into L2.

### Access data contiguously

Whenever possible, arrange data in memory so that it is accessed
contiguously. For example, instead of repeatedly striding through
memory, data might be first tiled, striped, or decimated to enable
contiguous access

The following techniques achieve better spatial locality in memory to
help avoid performance hazards:

- Bank conflicts - Lower address bits are typically used for parallel
banks of memory. Accessing data contiguously achieves a good
distribution of these address bits. If address bits [7:1] are unique
across elements within a vector, the operation is conflict-free. Use
a vector predicate to mask out “don’t care” values.
- Set aliasing. Caches hold some sets identified by lower address bits.
Each set has a small number of methods (typically 4 to 8) to help
manage aliasing and multi-threading.
- Micro-TLB misses. A limited number of pages are remembered for fast
translation. Containing data to a smaller number of pages helps
translation performance.

### Use nontemporal for final data

On the last use of data, use the :nt attribute. The cache uses this
hint to optimize the replacement algorithm.

### Scalar processing of vector data

When a VMEM store instruction produces data, that data is placed into
the L2 cache and L1 does not contain a valid copy. Thus, if scalar
loads must access the data, it first must be fetched into L1.

Algorithms use the vector engine to produce results that must further
process on the scalar core. The best practice is to use VMEM stores
to get the data into L2, then use DCFETCH to get the data in L1,
followed by scalar load instructions. Execute the DCFETCH anytime
after the VMEM store, however, software should budget at least 30
cycles before issuing the scalar load instruction.

### Avoid scatter/gather stalls

Scatter and gather operations compete for memory and can result in
long latency, therefore take care to avoid stalls. The following
techniques improve performance around scatter and gather:

- Distribute accesses across the intra-vector address range (lower
address bits). Even distribution across the least significant
inter-vector address bits can also be beneficial. Address bits [10:3]
are important to avoid conflicts. Ideally this applies per vector
instruction, but distributing these accesses out between vector
instructions can help absorb conflicts within a vector instruction.
- Minimize the density of scatter and gather instructions. Spread out
these instructions in a larger loop rather than concentrating them in
a tight loop. The hardware can process a small number of these
instructions in parallel. If it is difficult to spread these
instructions out, limit bursts to four for a specified thread.
- Defer loading from a gather result or a scatter store release. If the
in-flight scatters and gathers (including from other threads) avoid
conflicts, generally a distance of 12 or more packets is sufficient.
Double that distance if the addresses of in-flight accesses are not
correlated.

Peak scatter/gather performance for v79

| **Operation** | **Addressing** | **Vector bandwidth (per packet)** | **Latency (packets)** |
| --- | --- | --- | --- |
| Scatter | Conflict-free | 1/2 | 18 |
| Gather | Conflict-free | 1/2 | 24 |
| Scatter | Random | 1/6 | 30 |
| Gather | Random | 1/6 | 48 |

Last Published: Jan 16, 2025

[Previous Topic
Registers](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/registers.md) [Next Topic
Vector instructions](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/vector-instructions.md)