# HVX floating point

V79 replaces the prior internal HVX floating point format for
floating-point arithmetic. The new internal HVX floating-point format
yields results that are identical to IEEE-754 round-to-even mode. The
new format contains more bits than IEEE-754, which optionally produces
results with greater range and accuracy.

Only the HVX vector registers use the HVX floating-point format. Memory
maintains floating- point data in IEEE-754 format, and all loads/stores
use the IEEE-754 format. A subset of HVX floating-point operations
transform IEEE-754 floating-point data to HVX floating-point data.
Subsequent HVX floating-point instructions can consume operands in the
HVX floating-point without conversion to IEEE-754, which allows for
performant and energy efficient code. The program does not need to
continuously switch between formats. The program must convert the HVX
floating-point results to IEEE-754 prior to storing to memory.

HVX floating-point achieves IEEE-754 compliance through normalization.
The program can skip normalization when faster calculation is needed,
and IEEE-754 compliance is not required.

## Programming with HVX floating point

HVX floating-point contains two input types:

- qf32: single precision floating point
- qf16: half precision floating point

In Hexagon, IEEE-754 contains two input types:

- sf: single precision floating point
- hf: half precision floating point.

HVX floating point instructions use the same shift and multiply
resources as other HVX instructions.

### Handling the extended state of HVX floating-point

Only HVX floating-point source and destination instructions use HVX
floating-point values. Instructions specify the HVX floating-point
format with the `qf16` and `qf32` identifier. A source vector register drops
the extended state of a HVX floating-point value when an instruction
reads the source vector register without the `qf16` or `qf32` identifier. A
destination vector register resets its extended state when an
instruction writes to a vector register without the `qf16` or `qf32`
identifier. When dropping the extended state, the floating-point value
loses accuracy. The program can preserve the floating-point value by
converting HVX floating-point values to IEEE- 754 values. Software must
convert HVX floating-point values to IEEE-754 values before using as an
input to stores, permutes, shifts, and any other operations that do not
source the HVX floating- point format.

#### Examples of handling HVX floating-point value

Example of dropping extended state

V0.qf32 = vadd(V1.sf,V2.sf) // V0.qf32 holds extended state
    vmem(R0) = V0               // Extended state dropped, incorrect floating-point value stored
    Copy to clipboard

Example of resetting extended state

V0.qf32 = vadd(V1.sf,V2.sf) // V0.qf32 holds extended state
    V0 = V0                     // Extended state reset, floating-point value lost
    Copy to clipboard

Example of preserving floating-point value

V0.qf32 = vadd(V1.sf,V2.sf)  // V0.qf32 holds extended state
    V0.sf = V0.qf32              // Extended state converted to IEEE-754
    vmem(R0) = V0                // Value preserved and properly stored to memory
    Copy to clipboard

### Rules to achieve IEEE-754 compliance

Depending on the desired results, HVX floating-point operations have
requirements on the input sources. The HVX floating-point values require
normalization to achieve IEEE-754 compliance, while faster operations
can skip normalization. The program normalizes HVX floating-point values
before subsequent HVX floating-point operations, so the floating-point
value does not lose precision.

The program also obtains results identical to IEEE-754 by converting
all HVX floating-point results to IEEE-754 format before consumed in
any subsequent operation. However, there are cases where this conversion
is redundant, or the differences between IEEE-754 and HVX floating-point
might not be a concern.

The table below describes when a HVX floating-point operation directly consumes a
floating-point value as a source, when the floating-point values need
normalization, and when the floating-point values must be converted to
IEEE-754 before an operation.

| **Instruction** | **Inputs to add/subtract instructions** | **Inputs to add/subtract instructions** | **Inputs to add/subtract instructions** | **Inputs to multiplication instructions** | **Inputs to multiplication instructions** | **Inputs to multiplication instructions** | **Inputs to multiplication instructions** | **Inputs to multiplication instructions** | **Inputs to multiplication instructions** | **Non-HVX floating point instructions** | **Non-HVX floating point instructions** | **Non-HVX floating point instructions** |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **Sources** | **IEEE-754** | **HVX floating point format from multi** | **HVX floating point format from multi** | **sf** | **qf32 from multi** | **qf32 from adder** | **hf** | **qf16 from multi** | **qf16 from adder** | **IEEE 754** | **HVX floating point from multi** | **HVX floating point from adder** |
| Strict IEEE-754 compliance | Direct use | Convert to IEEE | Convert to IEEE | Normalize | Convert to IEEE then normalize | Convert to IEEE then normalize | Widening multiply then convert to IEEE | Convert to IEEE, widening multiply, then convert to IEEE | Convert to IEEE, widening multiply, then convert to IEEE | Direct use | Convert to IEEE | Convert to IEEE |
| IEEE-754 compliance^a^ | Direct use | Direct use | Direct use | Normalize | Direct use | Normalize | Widening multiply | Direct use | Widening multiply | Direct use | Convert to IEEE | Convert to IEEE |
| Lossy subnormals^b^ | Direct use | Direct use | Direct use | Direct use | Direct use | Normalize | Direct use | Direct use | Widening multiply | Direct use | Convert to IEEE | Convert to IEEE |
| Similar accuracy to prior HVX floating-point^c^ | Direct use | Direct use | Direct use | Direct use | Direct use | Direct use | Direct use | Direct use | Direct use | Direct use | Convert to IEEE | Convert to IEEE |

1. Excludes IEEE-754 overflows and lower precision subnormals due to
larger dynamic range than IEEE-754. All subnormals have extra
precision. Results that would result in infinity for IEEE-754 can be
represented as a finite value in HVX floating-point.
2. Using IEEE-754 subnormals without normalization results in a loss of
accuracy. This provides greater precision than a clamp of subnormals
to 0. When the data set excludes subnormals, the behavior is the same
as the IEEE-754 Compliance row.
3. Loss of 1 bit of accuracy compared to IEEE-754.

#### HVX floating point normalization

Unnormal inputs for multiplication operands naturally yield
non-IEEE-754-compliant results. A loss of up to a half unit of least
precision (ULP) of input precision can occur. For more performant code,
skip normalization when IEEE-754 compliance is not a requirement.

Use the following sequences to normalize.

##### qf16 and hf normalization

For qf16 and hf operands, use a widening multiply to qf32, then convert
back to qf16. Otherwise, a non-widening multiply has an input error on
unnormal qf16 and hf inputs.

The following examples are Hexagon assembly code for a widening
multiply.

Using two hf inputs:

{
       v0.hf = some IEEE-754 number
       v1.hf = some IEEE-754 number
    }
    {
       v3:2.qf32 = vmpy(v0.hf, v1.hf) // Widening multiply to qf32
    }
    {
       v4.hf = v3:2.qf32 // (Optional) Convert back to hf for strict IEEE-754 compliance
    }
    Copy to clipboard

Using two qf16 inputs:

{
       v0.qf16 = some qf16 number
       v1.qf16 = some qf16 number
    }
    {
       v3:2.qf32 = vmpy(v0.qf16, qf16) // Widening multiply to qf32
    }
    {
       v4.hf = v3:2.qf32 // (Optional) Convert back to hf for strict IEEE-754 compliance
    }
    Copy to clipboard

##### qf32 and sf normalization

Add a calculated zero value, defined as -0 with a minimum exponent
(-255), to an unnormal number to create a normalized HVX floating-point
value. Store the calculated zero in a vector register for reuse in
future operations that require normalization.

The following examples are Hexagon assembly code to calculate -0 and
normalize.

Using two sf inputs:

{
       v0.sf = some IEEE-754 number
       v1.sf = some IEEE-754 number
    }
    {
       v3:sf = 0x0 // IEEE-754 0
       v4.sf = 0x80000000 // IEEE-754 -0
    }
    {
       v3.qf32 = vmpy(v3.sf, v4.sf) // Create -0 with exponent at qf32 emin
    }
    {
       v0.qf32 = vadd(v3.qf32, v0.sf) // Normalize to HVX floating point
       v1.qf32 = vadd(v3.qf32, v1.sf) // Normalize to HVX floating point
    }
    {
       v2.qf32 = vmpy(v0.qf32, v1.qf32) // Multiply normalized HVX floating point values
    }
    Copy to clipboard

Using two qf32 inputs:

{
       v0.qf32 = some qf32 number
       v1.qf32 = some qf32 number
    }
    {
       v3:sf = 0x0        // IEEE-754 0
       v4.sf = 0x80000000 // IEEE-754 -0
    }
    {
       v3.qf32 = vmpy(v3.sf, v4.sf)     // Create -0 with exponent at qf32 emin
    }
    {
       v0.qf32 = vadd(v3.qf32, v0.qf32) // Normalize to HVX floating point
       v1.qf32 = vadd(v3.qf32, v1.qf32) // Normalize to HVX floating point
    }
    {
       v2.qf32 = vmpy(v0.qf32, v1.qf32)    // Multiply normalized HVX floating point values
    }
    Copy to clipboard

## HVX floating point behavior

The table below shows the characteristics of each floating point format
alongside the IEEE-754 format.

HVX floating-point format with IEEE-754 format

| **Type** | **Precision** | **Maximum exponent** | **Minimum exponent** |
| --- | --- | --- | --- |
| hf half precision IEEE 754 | 11 | 15 | -14 |
| qf16 single precision HVX format | 11 | 15 | -15 |
| sf single precision IEEE 754 | 24 | 127 | -126 |
| qf32 single precision HVX format | 24 | 255 | -255 |

### Represented values of HVX floating-point

#### Normal and unnormal numbers

A normal number is a value where the significand is within [-2, -1) for
negative values and [1, 2) for positive values.

[1.0, 2.0) \* 2^exp^ for positive values

[-2.0, -1.0) \* 2^exp^ for negative values

An unnormal number is a value where the significand is out of the normal
number range. An unnormal number has less than full precision compared
to a normal number.

[-1.0, 1.0) \* 2^exp^ for unnormal values

Unlike IEEE-754, an HVX floating-point unnormal number is not a special
number. IEEE-754 unnormal, or subnormal, is a number at the minimum
exponent whereas HVX floating-point unnormal can be at any exponent.

#### Exact and inexact numbers

An exact number is when HVX floating-point precisely represents an exact
floating-point value.

An inexact number is when an exact value cannot be precisely
represented. Only qf32 indicates an inexact number. Conversion from qf32
to IEEE-754 uses inexact to prevent double round errors.

All special values have an exact and inexact representation.

#### Sign of numbers

HVX floating-point values represent negative and positive numbers like
IEEE-754. Both polarities are available for non-zero finites, zero,
infinity, and Not a Number (NaN).

### Special values

#### Zero

Zero values in HVX floating-point can have different exponent
representation, ranging from the minimum exponent to the maximum
exponent of the HVX floating-point exponent range.

HVX floating-point zero produces equivalent behavior to IEEE-754 zeros.

#### Infinity

Infinity results from a number too large to represent.

HVX floating-point infinity produces equivalent behavior to IEEE-754
infinities.

#### NaN

NaN describes a number that is undefined.

HVX floating-point NaN produces equivalent behavior to IEEE-754 NaN.

### Rounding modes

HVX floating point only supports IEEE-754 round-to-even mode.

#### Round to nearest even

For any HVX floating point operation, the result rounds to the nearest
value. When the result is exactly halfway between two representable
values, the result rounds to the nearest even value.

For HVX floating point when a value overflows in the normal range, the exponent increments to
keep the significand in [-2, 2). For IEEE-754, the exponent adjusts to
keep the significand normalized.

When the exponent overflows, the result saturates to infinity

When the exponent underflows, the value results in an inexact zero. This
is different from IEEE- 754, where IEEE-754 gradually underflows and
produces subnormals.

## IEEE intrinsics

The user is encouraged to use the IEEE intrinsics for floating point
operations. The IEEE intrinsics provide a standardized mechanism to
operate on IEEE-754 data and eliminate the need to understand the
underlying details of the HVX floating point format. The IEEE
intrinsics round to the nearest even.

The compiler flags enable levels of compliance to the IEEE-754
specification. For details on the flags for IEEE-754 compliance, see
the *Qualcomm Hexagon LLVM C/C++ Compiler User Guide* (80-
VB419-8986).

### IEEE absolute value

Single precision and half precision absolute value.

IEEE absolute value instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.hf = vabs(Vu.hf) | HVX\_Vector Q6\_Vhf\_vabs\_Vhf(HVX\_Vector Vu) |
| Vd.sf = vabs(Vu.sf) | HVX\_Vector Q6\_Vsf\_vabs\_Vsf(HVX\_Vector Vu) |

### IEEE addition/subtraction

Single precision and half precision addition/subtraction.

IEEE addition/subtraction instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.hf = vadd(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vadd\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.hf = vsub(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vsub\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vdd.sf = vadd(Vu.hf,Vv.hf) | HVX\_VectorPair Q6\_Wsf\_vadd\_VbfVbf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vdd.sf = vsub(Vu.hf,Vv.hf) | HVX\_VectorPair Q6\_Wsf\_vsub\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.sf = vadd(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vsf\_vadd\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.sf = vsub(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vsf\_vsub\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |

### IEEE min/max/negate/copy

Min/max: IEEE compare the inputs and return the min or max value. If
either operand is NaN the result is NaN.

Negate: IEEE single precision and half precision negation, only the sign
bit is flipped. Copy: IEEE copy, no change in bits.

IEEE min/max/negation/move instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.w = vfmv(Vu.w) | HVX\_Vector Q6\_Vw\_vfmv\_Vw(HVX\_Vector Vu) |
| Vd.hf = vfmax(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vfmax\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.hf = vfmin(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vfmin\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.sf = vfmax(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vsf\_vfmax\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.sf = vfmin(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vsf\_vfmin\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.hf = vfneg(Vu.hf) | HVX\_Vector Q6\_Vhf\_vfneg\_Vhf(HVX\_Vector Vu) |
| Vd.sf = vfneg(Vu.sf) | HVX\_Vector Q6\_Vsf\_vfneg\_Vsf(HVX\_Vector Vu) |

### IEEE multiplication

IEEE single precision and half precision multiplication.

IEEE multiply instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.hf= vmpy(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vmpy\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vdd.sf = vmpy(Vu.hf,Vv.hf) | HVX\_VectorPair Q6\_Wsf\_vmpy\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vx.hf += vmpy(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vhf\_vmpyacc\_VhfVhfVhf(HVX\_Vector Vx, HVX\_Vector Vu, HVX\_Vector Vv) |
| Vxx.sf += vmpy(Vu.hf,Vv.hf) | HVX\_VectorPair Q6\_Wsf\_vmpyacc\_WsfVhfVhf(HVX\_VectorPair Vxx, HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.sf = vmpy(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vsf\_vmpy\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |

### IEEE fused multiplication

Half precision to single precision fused multiply reduce.

sf = Vu.hf[0]*Vv.hf[0] + Vu.hf[1]*Vv.hf[1]
    Copy to clipboard

The multiply operations and the addition operation round independently.

IEEE fused multiply instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.sf = vdmpy(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vsf\_vdmpy\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vxx.sf += vdmpy(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vsf\_vdmpyacc\_VsfVhfVhf(HVX\_Vector Vx, HVX\_Vector, Vu, HVX\_Vector Vv) |

### IEEE converts

Convert IEEE single/half precision to byte/halfword/unsigned
byte/unsigned half word. Convert byte/halfword/unsigned byte/unsigned
halfword to IEEE single/half precision. All operations are round to
nearest even.

IEEE convert instruction intrinsics

| **Instruction syntax** | **Intrinsic** |
| --- | --- |
| Vd.b = vcvt(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vb\_vcvt\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.h = vcvt(Vu.hf) | HVX\_Vector Q6\_Vh\_vcvt\_Vhf(HVX\_Vector Vu) |
| Vd.hf = vcvt(Vu.h) | HVX\_Vector Q6\_Vhf\_vcvt\_Vh(HVX\_Vector Vu) |
| Vd.hf = vcvt(Vu.sf,Vv.sf) | HVX\_Vector Q6\_Vhf\_vcvt\_VsfVsf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.hf = vcvt(Vu.uh) | HVX\_Vector Q6\_Vhf\_vcvt\_Vuh(HVX\_Vector Vu) |
| Vd.ub = vcvt(Vu.hf,Vv.hf) | HVX\_Vector Q6\_Vub\_vcvt\_VhfVhf(HVX\_Vector Vu, HVX\_Vector Vv) |
| Vd.uh = vcvt(Vu.hf) | HVX\_Vector Q6\_Vuh\_vcvt\_Vhf(HVX\_Vector Vu) |
| Vdd.hf = vcvt(Vu.b) | HVX\_VectorPair Q6\_Whf\_vcvt\_Vb(HVX\_Vector Vu) |
| Vdd.hf = vcvt(Vu.ub) | HVX\_VectorPair Q6\_Whf\_vcvt\_Vub(HVX\_Vector Vu) |
| Vdd.sf = vcvt(Vu.hf) | HVX\_VectorPair Q6\_Wsf\_vcvt\_Vhf(HVX\_Vector Vu) |

Last Published: Jan 16, 2025

[Previous Topic
Vector instructions](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/vector-instructions.md) [Next Topic
HVX PMU events](https://docs.qualcomm.com/bundle/publicresource/80-N2040-61/topics/hvx-pmu-events.md)

Source: [https://docs.qualcomm.com/doc/80-N2040-61/topic/hvx-floating-point.html](https://docs.qualcomm.com/doc/80-N2040-61/topic/hvx-floating-point.html)