# Performance optimization

## Clocks and counters

It is common to need timing functionality when optimizing software.
It is best to leverage the existing OS-supplied timing functionality.
These will typically use the underlying hardware counters of the system and provide the easiest way to obtain accurate information.
In those situations where direct underlying access to counter registers are required, Snapdragon X supports both the ARM standard cntpct and cntvct system counters, as well as a cycle counter via the PMUv3 architecture.

`cntpct_el0` is a physical register that holds the current value of the system counter.
The `cntvct_el0` register is a virtual counter that holds the value of the physical system counter minus an offset set by the OS.
Both the `cntpct_el0` and `cntvct_el0` registers are available to be read from user space.
This counter is incremented at the frequency set in the `cntfrq_el0` register.
The `cntfrq_el0` register is not populated by hardware, but rather set by software at the highest level exception of the system.
It can be read and used by the by the OS and is also readable from user space.
Timing in software can be determined by using the physical counter registers and the value in the `cntfrq` register.
The value in `cntfrq`, and the frequency if the system counters, is not the raw system clock, but rather a lower frequency, usually between 1 MHz and 50 MHz.
The use of system counters is the most commonly used method for timing functions. To query the `cntvct_el0` register, the MRS assembly instruction can be used – `mrs x0, cntvct_el0`.
It is preferable to precede the reading of the counter with an ISB instruction to ensure that it the reading is not speculatively executed.

The `pmccntr_el0` register is a 64-bit wide cycle counter register, implemented as part of the PMUv3 architecture.
It is subject to any changes in the clock frequency, meaning that it will not be incremented when waiting in a WFI or WFE.
This should be kept in mind when attempting to time any portions of code that may be subject to `sleep()` calls or interrupt waiting.
To query the counter, the MRS assembly instruction can be used – `mrs X0, pmccntr_el0`.

Note that this register can only be read from user space if it is enabled in the `pmuserenr` register.
This register can be read from user space through the `pmuserenr_el0` register – `mrs X0, pmuserenr_el0`.
If bit 2 is set, then the `pmccntr_el0` cycle counter can be read from user space.

## Vectorization

Vectorization is important in game programming, as most 2D and 3D games heavily resort to using vector mathematics during gameplay.
Optimizing vector calculations is an important aspect of optimizing the performance of the overall game.
This section highlights various ways that vectorization can be used and optimized on Snapdragon X CPUs.

### Vector support in Snapdragon X

The Snapdragon X CPU supports the use of vectorized code by employing NEON SIMD instructions.
NEON was designed as an additional load/store architecture to provide good vectorizing compiler support for languages such as C/C++.
It implements the ability to perform vector operations on data to increase performance by providing:

- 32 128-bit vector registers, each capable of containing multiple lanes of data.
- SIMD instructions to operate simultaneously on those multiple lanes of data.

These SIMD instructions can improve the speed of many computationally intensive algorithms, including audio and video processing and 3D space transformations.

NEON instructions allow up to:

- 16x8-bit, 8x16-bit, 4x32-bit, and 2x64-bit integer operations
- 8x16-bit, 4x32-bit, and 2x64-bit floating point operations

This allows, for example, the simultaneous addition of 4 32-bit floating point numbers to another 4 32-bit floating point numbers in a single operation.
This can greatly improve performance of computation heavy routines.

Note that the Snapdragon X CPU does not support the SVE extensions from ARM.

There are a number of ways to exploit NEON support in software:

- Using specialized libraries that provide vectorized versions of common functions.
- Using auto-vectorization features of compilers that automatically optimize code to take advantage of NEON.
- Leveraging NEON intrinsics, which are function calls that the compiler replaces with appropriate NEON instructions.
- Using specialized compilers, libraries, and coding techniques, such as the Intel ISPC, which allows a C programmer to specify parts of the code that should be vectorized.
ISPC leverages NEON to create performant code on ARM systems.

### Specialized libraries

Specialized libraries that leverage vectorized NEON optimizations exist for some common math, audio, and video functions.
Some of these are open-source libraries and some are royalty-free licenses.
Some of these libraries can be found on the [ARM website](https://developer.arm.com/documentation/den0018/a/Compiling-NEON-Instructions/NEON-libraries?lang=en).

### NEON intrinsics

NEON intrinsics provide a way to write NEON code that is easier to maintain than assembler code, while still enabling control of the generated NEON instructions.
NEON intrinsics are function calls that the compiler replaces with an appropriate NEON instruction or sequence of NEON instructions.
Intrinsic functions and data types, or intrinsics in the shortened form, provide access to low-level NEON functionality from C or C++ source code.
However, the code must be optimized to take full advantage of the speed increases offered by the NEON unit.
For example, instead of a single uint32\_t variable, which holds single 32-bit integer, with intrinsics there is a single uint32x4\_t variable, which can hold 4 32-bit integers in a single 128-bit vector register.

Software can pass NEON vectors as function arguments or return values and declare them as normal variables.
In the following example, two vectors are added together, with the function taking two pointers to memory that each point to four consecutive 32-bit integers and returning essentially four 32-bit numbers.
The intrinsic call to `vld1q_u32()` takes a pointer and loads four 32-bit integers into a single NEON vector register.
Likewise, the `vaddq_u32()` function adds two NEON vector registers together, each register holding four 32-bit integers, and puts the resulting four 32-bit numbers into another NEON register.

Source code

uint32x4_t sumVector(uint32_t* A, uint32_t* B)
    {
        uint32x4_t temp = vld1q_u32(A);
        uint32x4_t temp1 = vld1q_u32(B);
        uint32x4_t vec128 = vaddq_u32(temp, temp1);
    
        return vec128;
    }
    Copy to clipboard

| GCC | Clang | MSVC |
| --- | --- | --- |
| ldr     q0, [x0]<br>    ldr     q31, [x1]<br>    add     v0.4s, v0.4s, v31.4s<br>    ret<br>    Copy to clipboard | ldr     q0, [x0]<br>    ldr     q1, [x1]<br>    add     v0.4s, v1.4s, v0.4s<br>    ret<br>    Copy to clipboard | ldr         q17,[x1]<br>    ldr         q16,[x0]<br>    add         v0.4s,v16.4s,v17.4s<br>    ret<br>    Copy to clipboard |

In this case, all compilers give the same assembly due to the use of intrinsics.
This allows the programmer more control over the final assembly, at the cost of requiring a deeper understanding of the instruction architecture.

Intrinsics provide almost as much control as writing assembly language, but leave the allocation of registers to the compiler, so that programmers can focus on the algorithms.
Also, the compiler can optimize the intrinsics such as normal C or C++ code, replacing them with more efficient sequences if possible.
It can also perform instruction scheduling to remove pipeline stalls for the specified target processor.
This leads to more maintainable source code than using assembly language.
The downside to intrinsics is that they are specific to the ARM64 architecture and are not cross platform.

For game programming, the use of intrinsics may be good in those cases where already-written code could take advantage of specific intrinsic routines
or there are specific areas in existing ARM64 code that could be targeted for performance improvements with intrinsics.

### Auto-vectorization

Auto-vectorization is the generation of NEON code by a compiler by analyzing the source code of a program.
This is an architecture independent approach that allows the game programmer to focus at the algorithmic level and not be concerned with specific architectural details of the target.
Most of the major compilers (i.e., GCC, Clang, MSVC) will do auto-vectorization when the appropriate flags are set.
Different compilers will vectorize to different degrees and the same code may get vectorized by one compiler and not get vectorized by another.
Sometimes the compiler can generate NEON code without source modification, but using certain coding styles can promote more optimal output.
A rule of thumb is to try not to do the compiler’s job.
It is best to use straightforward code and let the compiler’s optimizer create the best code.
There are times, however, that restructuring loops can help the compiler better optimize.

The generation of NEON instructions is enabled in the GCC compiler with the use of a `-march` flag of ARMv8-a and above.
However, the Snapdragon X CPU is 8.7 compliant, so the `-march=armv8.7-a` flag can be used in the latest versions of GCC.
This will not only enable the generation of NEON instructions but provide support for other features in the Snapdragon X CPU.
GCC feature flags that should not be enabled include `+profile`, `+memtag`, and `+sve`.

As an example of auto-vectorization, consider the sum of two 3D vectors.
Because this operation is so fundamental within gameplay, it is important to optimize these types of vector operations as much as possible.
The table below shows a simple vector addition function and the resultant assembly using GCC, Clang, and MSVC.

Source code

struct FloatVector
    {
        float V[3];
    };
    
    void sumVector(FloatVector& __restrict A, FloatVector& __restrict B, FloatVector& R)
    {
        for (int i = 0; i < 3; i++)
        {
            R.V[i] = A.V[i] + B.V[i];
        }
    }
    Copy to clipboard

| GCC 13.2<br><br>`-O2 -march=armv8.7-a` | Clang 17.0<br><br>`-O2 -march=armv8.7-a` | MSVC 19.0<br><br>`/O2 /arch:armv8.7` |
| --- | --- | --- |
| ldr     d28, [x0]<br>    ldr     d30, [x1]<br>    ldr     s31, [x0, 8]<br>    ldr     s29, [x1, 8]<br>    fadd    v30.2s, v30.2s, v28.2s<br>    fadd    s31, s31, s29<br>    str     d30, [x2]<br>    str     s31, [x2, 8]<br>    ret<br>    Copy to clipboard | ldr     d0, [x0]<br>    ldr     d1, [x1]<br>    ldr     s2, [x0, #8]<br>    ldr     s3, [x1, #8]<br>    fadd    v0.2s, v0.2s, v1.2s<br>    fadd    s1, s2, s3<br>    str     d0, [x2]<br>    str     s1, [x2, #8]<br>    ret<br>    Copy to clipboard | ldp         s19,s18,[x1]<br>    ldp         s17,s16,[x0]<br>    fadd        s16,s16,s18<br>    fadd        s17,s17,s19<br>    stp         s17,s16,[x2]<br>    ldr         s17,[x0,#8]<br>    ldr         s16,[x1,#8]<br>    fadd        s16,s17,s16<br>    str         s16,[x2,#8]<br>    ret<br>    Copy to clipboard |

In this example, GCC and Clang produce identical code.
MSVC declined to use any SIMD instructions, as the compiler indicated that there were not enough instructions to warrant vectorization.
However, note that it required an extra fadd instruction over GCC and Clang.

If the number of floats in the vector is increased from three to four, then all compilers will vectorize the code the same way.

| GCC | Clang | MSVC |
| --- | --- | --- |
| ldr q31, [x0]<br>    ldr q30, [x1]<br>    fadd v31.4s, v31.4s, v30.4s<br>    str  q31, [x2]<br>    ret<br>    Copy to clipboard | ldr  q0, [x0]<br>    ldr  q1, [x1]<br>    fadd v0.4s, v0.4s, v1.4s<br>    str  q0, [x2]<br>    ret<br>    Copy to clipboard | ldr  q17,[x0]<br>    ldr  q16,[x1]<br>    fadd v16.4s,v17.4s,v16.4s<br>    str  q16,[x2]<br>    ret<br>    Copy to clipboard |

This simple example helps highlight the importance of not only using vectorized code, but packing as many vector operations as possible to gain as much performance as possible.

This example still only adds two vectors at a time.
By using forethought in the design of the software, the performance for doing multiple vector additions can be increased.
In the next example, a structure of arrays approach is used to pack four vectors into a single struct and the computations are done on all four vectors at a time.

Source code

struct FloatVectorSoA
    {
        float X[4];
        float Y[4];
        float Z[4];
        float W[4];
    };
    void sumSoAFloat(FloatVectorSoA& __restrict A, FloatVectorSoA& __restrict B, FloatVectorSoA& R)
    {
        for (int i=0;i<4;i++)
        {
            R.X[i] = A.X[i] + B.X[i];
            R.Y[i] = A.Y[i] + B.Y[i];
            R.Z[i] = A.Z[i] + B.Z[i];
            R.W[i] = A.W[i] + B.W[i];
        }
    
    }
    Copy to clipboard

| GCC | Clang | MSVC |
| --- | --- | --- |
| ldp     q24, q29, [x0]<br>    ldp     q28, q25, [x1]<br>    ldp     q30, q31, [x0, 32]<br>    ldp     q26, q27, [x1, 32]<br>    fadd    v28.4s, v28.4s, v24.4s<br>    fadd    v29.4s, v29.4s, v25.4s<br>    fadd    v30.4s, v30.4s, v26.4s<br>    fadd    v31.4s, v31.4s, v27.4s<br>    stp     q28, q29, [x2]<br>    stp     q30, q31, [x2, 32]<br>    ret<br>    Copy to clipboard | ldp     q0, q3, [x1]<br>    ldp     q1, q2, [x0]<br>    fadd    v0.4s, v1.4s, v0.4s<br>    ldp     q1, q5, [x0, #32]<br>    fadd    v2.4s, v2.4s, v3.4s<br>    ldp     q4, q3, [x1, #32]<br>    fadd    v1.4s, v1.4s, v4.4s<br>    fadd    v3.4s, v5.4s, v3.4s<br>    stp     q0, q2, [x2]<br>    stp     q1, q3, [x2, #32]<br>    ret<br>    Copy to clipboard | add         x9,x2,#0x20<br>     add         x8,x1,#0x10<br>     sub         x13,x0,x1<br>     sub         x12,x2,x1<br>     sub         x11,x0,x2<br>     mov         x10,#4<br>    |$LL22@S|<br>     ldur        s17,[x8,#-0x10]<br>     sub         x10,x10,#1<br>     ldr         s16,[x0],#4<br>     fadd        s16,s17,s16<br>     ldr         s17,[x13,x8]<br>     stur        s16,[x9,#-0x20]<br>     ldr         s16,[x8]<br>     fadd        s16,s17,s16<br>     ldr         s17,[x11,x9]<br>     str         s16,[x12,x8]<br>     ldr         s16,[x8,#0x10]<br>     fadd        s16,s17,s16<br>     ldr         s17,[x8,#0x20]<br>     add         x8,x8,#4<br>     str         s16,[x9],#4<br>     ldr         s16,[x0,#0x2C]<br>     fadd        s16,s17,s16<br>     str         s16,[x9,#0xC]<br>     cbnz        x10,|$LL22@sumS|<br>     ret<br>    Copy to clipboard |

Again, GCC and Clang generated essentially the same vectorized code.
However, note that MSVC could not vectorize the code due to it the vectorizer determining that the *loop contains loop-carried data dependencies that prevent vectorization*.
There are two important takeaways in this example.
The first is that the exact same code may produce very different results, depending on the compiler and the flags used.
It is important to check the produced code to see how well the compiler vectorized the code.
The second is that careful architectural planning in the software can dramatically improve performance.
In this case, structuring the software and underlying data in such a fashion that it allows the compiler to leverage the underlying vector capabilities of the hardware increases performance.

While auto-vectorization can produce some good results in certain cases, it is compiler dependent and may not always result in the expected vector improvements.
It does require careful software planning to obtain the best utilization of the vector capabilities of the system.

### ISPC

The Intel Implicit SPMD Program Compiler (ISPC) is a vectorizing compiler using code syntax very similar to, and intended to integrate with, C and C++.
This compiler is NOT an auto-vectorizing compiler, but rather a compiler for code intentionally written to be vectorized.
The advantage of this approach is that it forces the software developer to think about how vectorization fits into the entire game design.
With this forethought in the design, the ISPC creates very optimized code in an architecture independent manner.
ISPC v1.19 and later generates NEON code for ARM64 CPUs on Windows, as well as vectorized code for Intel platforms, making it suitable for cross-platform development.
It is an open-source compiler with a BSD license and leverages the LLVM compiler for its backend.

With ISPC, code is written with a C syntax with added features to support the ability to write Single Program Multiple Data (SPMD) programs in a straightforward manner.
The ISPC compiler generates regular object files using standard C/C++ calling conventions, allowing straightforward integration with existing C/C++ files.

ISPC leverages the width of the vector unit of the CPU to maximize the benefit of vector code.
For Snapdragon X, the vector unit is 128-bits wide.
This allows it to perform calculations on multiple lanes of data at the same time.
For example, an add instruction could operate on four lanes of 32-bit numbers at the same time, or two lanes of 64-bit numbers at the same time.
Pipelining inside the Snapdragon X increases throughput by allowing back-to-back instructions to be scheduled, increasing the performance impact of well-created vectorized code.

The data should be laid out correctly to gain the most benefit from ISPC.
Similar to the SoA example shown above, ISPC prefers data laid in SoA or AoSoA to produce the best use from the vector instructions.
There are gather/scatter instructions, but rearranging data can be costly if done often.
It is best to keep data rearrangement to a minimum and leverage successive vector instructions as much as possible to obtain the best performance.

Consider the following example, which is equivalent to the SoA vectorization approach above.

Source code

struct FloatVector
    {
        float V[4];
    };
    
    FloatVector operator+(const FloatVector& A, const FloatVector& B)
    {
        FloatVector R;
    
        for (uniform int i=0; i< 4; i++)
        {
            R.V[i] = A.V[i] + B.V[i];
        }
    
        return R;
    }
    Copy to clipboard

| ISPC<br><br>`--target=neon --arch=aarch64` |
| --- |
| ldp     q0, q1, [x0]<br>    ldp     q2, q3, [x1]<br>    ldp     q4, q5, [x0, #32]<br>    ldp     q6, q7, [x1, #32]<br>    fadd    v0.4s, v0.4s, v2.4s<br>    fadd    v1.4s, v1.4s, v3.4s<br>    fadd    v2.4s, v4.4s, v6.4s<br>    fadd    v3.4s, v5.4s, v7.4s<br>    ret<br>    Copy to clipboard |

Note that in this case, the resultant assembly code is essentially the same as GCC and Clang assembly code for the SoA case.
ISPC will automatically try to generate code such that all lanes of the SIMD unit will be filled.
This makes intentionally writing vectorized code easier, as the programmer can focus on single vector operations and the compiler will attempt to vectorize the code to use all available lanes.
While the code appears to be written as only computing one (x,y,z,w) vector, the code actually can do four of them at the same time.
This does assume, however, that the data arrangement allows this approach.

### Unreal Engine

The Unreal Engine from Epic Games uses ISPC within the game engine to provide vectorized implementations of computationally intensive routines.
Currently, it is used within the Chaos physics system and in the animation system.
Epic supports using it in custom code as well.
All that is needed is to add the ISPC module to the `build.cs` file, add ISPC files to the project, include the auto-generated C++ headers, and the Unreal build system will take care of the rest.
Epic recommends using it for dense compute-bound workloads such as physics and cloth simulations or vertex transformations.
It is best used with contiguous memory loads, manipulations, and stores, for example, the Unreal TArray structures.
It is best used when there are no data dependencies between operations, and can be especially useful when combined with ParallelFor constructs and batching.

## Topology and threading/affinity

Properly understanding, detecting, and using the system topology is critical to obtaining the best gaming performance.
In the current desktop and mobile market, processors come in a variety of different core configurations.
Some configurations may be homogeneous, where every core is the same, or heterogeneous, where there are different core types within the same chip.
Additionally, there may be a difference in the clocking speeds of the cores, regardless of whether the cores are the same across all processors.

The Snapdragon X processor comes in symmetrical or asymmetrical configurations.
In a symmetrical configuration, all cores are the same and run at the same speed.
In an asymmetrical configuration, all cores are of the same type, but there may be a mix of lower speed efficiency cores and higher speed performance cores.
An example configuration might be a 12-core version which has 4 cores clocked at 2.5 GHz and 8 cores clocked at 3.4 GHz.
To optimize gaming, it may be best to run rendering threads or time-critical code on the fastest processors.

The number of physical cores on an Snapdragon X system is always the same as the number of logical cores.
This is important to keep in mind as gaming code may consider both the number of physical cores and the number of virtual cores when determining how many and what types of threads are created.
For example, the Unreal Engine determines the number of foreground and background threads based on the number of virtual cores, which for Snapdragon X is the same as the number of physical cores.

The process to determine how many cores there are and the speed at which each is running may differ across operating systems.
Generally, the best guidance is to use the least complex solution that obtains the desired information and consistently use it across the codebase.
For Windows, there are several methods to determine the target CPU’s topology.
The Windows API provides several methods, such as `GetLogicalProcessorInformation` and `GetSystemCPUSetInformation`, that allow users to fully enumerate logical processors.

### Example

The following example shows how to use `GetLogicalProcessorInformationEx` to query the number of processors and the speeds at which they run.

#include <Windows.h>
    #include <iostream>
    
    extern "C" {
    #include <Powrprof.h>
    }
    
    #include <vector>
    
    #pragma comment(lib, "Powrprof.lib")
    
    typedef struct _PROCESSOR_POWER_INFORMATION {
        ULONG  Number;
        ULONG  MaxMhz;
        ULONG  CurrentMhz;
        ULONG  MhzLimit;
        ULONG  MaxIdleState;
        ULONG  CurrentIdleState;
    } PROCESSOR_POWER_INFORMATION, * PPROCESSOR_POWER_INFORMATION;
    
    typedef BOOL(WINAPI* LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);
    typedef BOOL(WINAPI* LPFN_GLPIEX)(LOGICAL_PROCESSOR_RELATIONSHIP, PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, PDWORD);
    typedef DWORD(WINAPI* LPFN_GAPC)(WORD);
    
    #define ALL_PROCESSOR_GROUPS 0xffff
    
    // Helper function to count set bits in the processor mask.
    DWORD CountSetBits(ULONG_PTR bitMask)
    {
        DWORD LSHIFT = sizeof(ULONG_PTR) * 8 - 1;
        DWORD bitSetCount = 0;
        ULONG_PTR bitTest = (ULONG_PTR)1 << LSHIFT;
        DWORD i;
    
        for (i = 0; i <= LSHIFT; ++i)
        {
            bitSetCount += ((bitMask & bitTest) ? 1 : 0);
            bitTest /= 2;
        }
    
        return bitSetCount;
    }
    
    DWORD GetInstalledProcessorCount()
    {
        // on Windows 7 and later, use GetActiveProcessorCount() ...
    
        LPFN_GAPC gapc = (LPFN_GAPC)GetProcAddress(GetModuleHandle(TEXT("kernel32")), "GetActiveProcessorCount");
        if (gapc)
            return gapc(ALL_PROCESSOR_GROUPS);
    
        // on Vista and later, try GetLogicalProcessorInformationEx() next ...
    
        LPFN_GLPIEX glpiex = (LPFN_GLPIEX)GetProcAddress(GetModuleHandle(TEXT("kernel32")), "GetLogicalProcessorInformationEx");
        if (glpiex)
        {
            std::vector<BYTE> buffer;
            PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = NULL;
            DWORD bufsize = 0;
    
            // not using RelationGroup because it does not return accurate info under WOW64...
            while (!glpiex(RelationProcessorCore, info, &bufsize))
            {
                if (GetLastError() != ERROR_INSUFFICIENT_BUFFER)
                    return 0;
    
                buffer.resize(bufsize);
                info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)&buffer[0];
            }
    
            DWORD logicalProcessorCount = 0;
    
            while (bufsize >= sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX))
            {
                // for RelationProcessorCore, info->Processor.GroupCount is always 1...
                logicalProcessorCount += CountSetBits(info->Processor.GroupMask[0].Mask);
                bufsize -= info->Size;
                info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((LPBYTE)info) + info->Size);
            }
    
            return logicalProcessorCount;
        }
    
        // on XP and later, try GetLogicalProcessorInformation() next...
    
        LPFN_GLPI glpi = (LPFN_GLPI)GetProcAddress(GetModuleHandle(TEXT("kernel32")), "GetLogicalProcessorInformation");
        if (glpi)
        {
            std::vector<BYTE> buffer;
            PSYSTEM_LOGICAL_PROCESSOR_INFORMATION info = NULL;
            DWORD bufsize = 0;
    
            while (!glpi(info, &bufsize))
            {
                if (GetLastError() != ERROR_INSUFFICIENT_BUFFER)
                    return 0;
    
                buffer.resize(bufsize);
                info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)&buffer[0];
            }
    
            DWORD logicalProcessorCount = 0;
    
            while (bufsize >= sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION))
            {
                if (info->Relationship == RelationProcessorCore)
                {
                    // A hyperthreaded core supplies more than one logical processor.
                    logicalProcessorCount += CountSetBits(info->ProcessorMask);
                }
    
                bufsize -= sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
                ++info;
            }
    
            return logicalProcessorCount;
        }
    
        // fallback to GetSystemInfo() last ...
    
        SYSTEM_INFO si = { 0 };
        GetSystemInfo(&si);
    
        return si.dwNumberOfProcessors;
    }
    
    int main(int argc, char* argv[])
    {
        DWORD dwNumProcessors = GetInstalledProcessorCount();
    
        std::vector<PROCESSOR_POWER_INFORMATION> a(dwNumProcessors);
        DWORD dwSize = sizeof(PROCESSOR_POWER_INFORMATION) * dwNumProcessors;
        std::cout << "Number of processors: " << dwNumProcessors << std::endl;
        CallNtPowerInformation(ProcessorInformation, NULL, 0, &a[0], dwSize);
        std::cout << "Processor\tMHz\tMaxMHz\tMHz Limit\tIdle State\tMax Idle State" << std::endl;
        for (int i = 0; i < dwNumProcessors; i++)
        {
            std::cout << a[i].Number << "\t\t";
            std::cout << a[i].CurrentMhz << '\t';
            std::cout << a[i].MaxMhz << '\t';
            std::cout << a[i].MhzLimit << "\t\t\t";
            std::cout << a[i].CurrentIdleState << "\t\t";
            std::cout << a[i].MaxIdleState << "\t" << std::endl;
        }

        a.clear();
    
        system("pause");
        return 0;
    }
    Copy to clipboard

### Thread priority

Once the topology has been determined, developers can decide if certain threads should be run on higher performance cores.
Thread priority has historically been used by game developers to accomplish two things: high priority, interrupt-style work and work directly related to the frame rate.
Typically, these kinds of tasks are put on high priority threads to ensure that things like audio processing (bursty, but time-dependent work)
or rendering (directly impacting framerate) work gets as many resources as possible.
For processors with homogeneous, symmetrical cores, thread priority by itself may work fine.
However, in Snapdragon X processors, which have asymmetrical cores (cores running at different frequencies), using thread priority alone may not result in optimum performance.
Though a thread may be given high priority to run, the OS may wind up choosing to place that thread on a lower frequency core.

### Thread affinity

Thread affinity attempts to restrict a particular thread to a set of cores, while still allowing the OS to move threads around the indicated subset.
It is recommended that the developer allows the OS to determine the affinity of threads and programs run on Snapdragon X.
Today’s operating systems do a good job of balancing performance and power consumption and allowing them to control where processes are run is typically the best course of action.

In those special cases where more control is needed, some major game development platforms may already have software infrastructure to set the affinity for certain threads or tasks.
Unreal Engine 5 has implemented functions to set affinity for all the platforms it supports.
The Unreal GameThread, for example, can have its affinity mask set via the `FPlatformAffinity::GetMainGameMask()` function, which gets called during the initialization of the engine.
There are also specific calls for setting the affinity mask for the rendering thread, RHI thread, audio thread, and foreground and background threads.
This can be overwritten in a platform specific file to return the appropriate mask for the specific Snapdragon X chip in use.

Last Published: Mar 03, 2026

[Previous Topic
Differences from X64](https://docs.qualcomm.com/bundle/publicresource/80-78185-2/topics/architecture.md) [Next Topic
Best Practices](https://docs.qualcomm.com/bundle/publicresource/80-78185-2/topics/cpu_best_practices.md)