177 lines
7.3 KiB
ReStructuredText
177 lines
7.3 KiB
ReStructuredText
=======================
|
|
NUMA Memory Performance
|
|
=======================
|
|
|
|
NUMA Locality
|
|
=============
|
|
|
|
Some platforms may have multiple types of memory attached to a compute
|
|
node. These disparate memory ranges may share some characteristics, such
|
|
as CPU cache coherence, but may have different performance. For example,
|
|
different media types and buses affect bandwidth and latency.
|
|
|
|
A system supports such heterogeneous memory by grouping each memory type
|
|
under different domains, or "nodes", based on locality and performance
|
|
characteristics. Some memory may share the same node as a CPU, and others
|
|
are provided as memory only nodes. While memory only nodes do not provide
|
|
CPUs, they may still be local to one or more compute nodes relative to
|
|
other nodes. The following diagram shows one such example of two compute
|
|
nodes with local memory and a memory only node for each of compute node::
|
|
|
|
+------------------+ +------------------+
|
|
| Compute Node 0 +-----+ Compute Node 1 |
|
|
| Local Node0 Mem | | Local Node1 Mem |
|
|
+--------+---------+ +--------+---------+
|
|
| |
|
|
+--------+---------+ +--------+---------+
|
|
| Slower Node2 Mem | | Slower Node3 Mem |
|
|
+------------------+ +--------+---------+
|
|
|
|
A "memory initiator" is a node containing one or more devices such as
|
|
CPUs or separate memory I/O devices that can initiate memory requests.
|
|
A "memory target" is a node containing one or more physical address
|
|
ranges accessible from one or more memory initiators.
|
|
|
|
When multiple memory initiators exist, they may not all have the same
|
|
performance when accessing a given memory target. Each initiator-target
|
|
pair may be organized into different ranked access classes to represent
|
|
this relationship. The highest performing initiator to a given target
|
|
is considered to be one of that target's local initiators, and given
|
|
the highest access class, 0. Any given target may have one or more
|
|
local initiators, and any given initiator may have multiple local
|
|
memory targets.
|
|
|
|
To aid applications matching memory targets with their initiators, the
|
|
kernel provides symlinks to each other. The following example lists the
|
|
relationship for the access class "0" memory initiators and targets::
|
|
|
|
# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
|
|
relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
|
|
|
|
# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
|
|
relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
|
|
|
|
A memory initiator may have multiple memory targets in the same access
|
|
class. The target memory's initiators in a given class indicate the
|
|
nodes' access characteristics share the same performance relative to other
|
|
linked initiator nodes. Each target within an initiator's access class,
|
|
though, do not necessarily perform the same as each other.
|
|
|
|
The access class "1" is used to allow differentiation between initiators
|
|
that are CPUs and hence suitable for generic task scheduling, and
|
|
IO initiators such as GPUs and NICs. Unlike access class 0, only
|
|
nodes containing CPUs are considered.
|
|
|
|
NUMA Performance
|
|
================
|
|
|
|
Applications may wish to consider which node they want their memory to
|
|
be allocated from based on the node's performance characteristics. If
|
|
the system provides these attributes, the kernel exports them under the
|
|
node sysfs hierarchy by appending the attributes directory under the
|
|
memory node's access class 0 initiators as follows::
|
|
|
|
/sys/devices/system/node/nodeY/access0/initiators/
|
|
|
|
These attributes apply only when accessed from nodes that have the
|
|
are linked under the this access's initiators.
|
|
|
|
The performance characteristics the kernel provides for the local initiators
|
|
are exported are as follows::
|
|
|
|
# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
|
|
/sys/devices/system/node/nodeY/access0/initiators/
|
|
|-- read_bandwidth
|
|
|-- read_latency
|
|
|-- write_bandwidth
|
|
`-- write_latency
|
|
|
|
The bandwidth attributes are provided in MiB/second.
|
|
|
|
The latency attributes are provided in nanoseconds.
|
|
|
|
The values reported here correspond to the rated latency and bandwidth
|
|
for the platform.
|
|
|
|
Access class 1 takes the same form but only includes values for CPU to
|
|
memory activity.
|
|
|
|
NUMA Cache
|
|
==========
|
|
|
|
System memory may be constructed in a hierarchy of elements with various
|
|
performance characteristics in order to provide large address space of
|
|
slower performing memory cached by a smaller higher performing memory. The
|
|
system physical addresses memory initiators are aware of are provided
|
|
by the last memory level in the hierarchy. The system meanwhile uses
|
|
higher performing memory to transparently cache access to progressively
|
|
slower levels.
|
|
|
|
The term "far memory" is used to denote the last level memory in the
|
|
hierarchy. Each increasing cache level provides higher performing
|
|
initiator access, and the term "near memory" represents the fastest
|
|
cache provided by the system.
|
|
|
|
This numbering is different than CPU caches where the cache level (ex:
|
|
L1, L2, L3) uses the CPU-side view where each increased level is lower
|
|
performing. In contrast, the memory cache level is centric to the last
|
|
level memory, so the higher numbered cache level corresponds to memory
|
|
nearer to the CPU, and further from far memory.
|
|
|
|
The memory-side caches are not directly addressable by software. When
|
|
software accesses a system address, the system will return it from the
|
|
near memory cache if it is present. If it is not present, the system
|
|
accesses the next level of memory until there is either a hit in that
|
|
cache level, or it reaches far memory.
|
|
|
|
An application does not need to know about caching attributes in order
|
|
to use the system. Software may optionally query the memory cache
|
|
attributes in order to maximize the performance out of such a setup.
|
|
If the system provides a way for the kernel to discover this information,
|
|
for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
|
|
the kernel will append these attributes to the NUMA node memory target.
|
|
|
|
When the kernel first registers a memory cache with a node, the kernel
|
|
will create the following directory::
|
|
|
|
/sys/devices/system/node/nodeX/memory_side_cache/
|
|
|
|
If that directory is not present, the system either does not provide
|
|
a memory-side cache, or that information is not accessible to the kernel.
|
|
|
|
The attributes for each level of cache is provided under its cache
|
|
level index::
|
|
|
|
/sys/devices/system/node/nodeX/memory_side_cache/indexA/
|
|
/sys/devices/system/node/nodeX/memory_side_cache/indexB/
|
|
/sys/devices/system/node/nodeX/memory_side_cache/indexC/
|
|
|
|
Each cache level's directory provides its attributes. For example, the
|
|
following shows a single cache level and the attributes available for
|
|
software to query::
|
|
|
|
# tree /sys/devices/system/node/node0/memory_side_cache/
|
|
/sys/devices/system/node/node0/memory_side_cache/
|
|
|-- index1
|
|
| |-- indexing
|
|
| |-- line_size
|
|
| |-- size
|
|
| `-- write_policy
|
|
|
|
The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
|
|
for any other indexed based, multi-way associativity.
|
|
|
|
The "line_size" is the number of bytes accessed from the next cache
|
|
level on a miss.
|
|
|
|
The "size" is the number of bytes provided by this cache level.
|
|
|
|
The "write_policy" will be 0 for write-back, and non-zero for
|
|
write-through caching.
|
|
|
|
See Also
|
|
========
|
|
|
|
[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
|
|
- Section 5.2.27
|