175 lines
6.5 KiB
ReStructuredText
175 lines
6.5 KiB
ReStructuredText
|
Using XSTATE features in user space applications
|
||
|
================================================
|
||
|
|
||
|
The x86 architecture supports floating-point extensions which are
|
||
|
enumerated via CPUID. Applications consult CPUID and use XGETBV to
|
||
|
evaluate which features have been enabled by the kernel XCR0.
|
||
|
|
||
|
Up to AVX-512 and PKRU states, these features are automatically enabled by
|
||
|
the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
|
||
|
are enabled by XCR0 as well, but the first use of related instruction is
|
||
|
trapped by the kernel because by default the required large XSTATE buffers
|
||
|
are not allocated automatically.
|
||
|
|
||
|
The purpose for dynamic features
|
||
|
--------------------------------
|
||
|
|
||
|
Legacy userspace libraries often have hard-coded, static sizes for
|
||
|
alternate signal stacks, often using MINSIGSTKSZ which is typically 2KB.
|
||
|
That stack must be able to store at *least* the signal frame that the
|
||
|
kernel sets up before jumping into the signal handler. That signal frame
|
||
|
must include an XSAVE buffer defined by the CPU.
|
||
|
|
||
|
However, that means that the size of signal stacks is dynamic, not static,
|
||
|
because different CPUs have differently-sized XSAVE buffers. A compiled-in
|
||
|
size of 2KB with existing applications is too small for new CPU features
|
||
|
like AMX. Instead of universally requiring larger stack, with the dynamic
|
||
|
enabling, the kernel can enforce userspace applications to have
|
||
|
properly-sized altstacks.
|
||
|
|
||
|
Using dynamically enabled XSTATE features in user space applications
|
||
|
--------------------------------------------------------------------
|
||
|
|
||
|
The kernel provides an arch_prctl(2) based mechanism for applications to
|
||
|
request the usage of such features. The arch_prctl(2) options related to
|
||
|
this are:
|
||
|
|
||
|
-ARCH_GET_XCOMP_SUPP
|
||
|
|
||
|
arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
|
||
|
|
||
|
ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
|
||
|
type uint64_t. The second argument is a pointer to that storage.
|
||
|
|
||
|
-ARCH_GET_XCOMP_PERM
|
||
|
|
||
|
arch_prctl(ARCH_GET_XCOMP_PERM, &features);
|
||
|
|
||
|
ARCH_GET_XCOMP_PERM stores the features for which the userspace process
|
||
|
has permission in userspace storage of type uint64_t. The second argument
|
||
|
is a pointer to that storage.
|
||
|
|
||
|
-ARCH_REQ_XCOMP_PERM
|
||
|
|
||
|
arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
|
||
|
|
||
|
ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
|
||
|
feature or a feature set. A feature set can be mapped to a facility, e.g.
|
||
|
AMX, and can require one or more XSTATE components to be enabled.
|
||
|
|
||
|
The feature argument is the number of the highest XSTATE component which
|
||
|
is required for a facility to work.
|
||
|
|
||
|
When requesting permission for a feature, the kernel checks the
|
||
|
availability. The kernel ensures that sigaltstacks in the process's tasks
|
||
|
are large enough to accommodate the resulting large signal frame. It
|
||
|
enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
|
||
|
sigaltstack(2) calls. If an installed sigaltstack is smaller than the
|
||
|
resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
|
||
|
sigaltstack(2) results in -ENOMEM if the requested altstack is too small
|
||
|
for the permitted features.
|
||
|
|
||
|
Permission, when granted, is valid per process. Permissions are inherited
|
||
|
on fork(2) and cleared on exec(3).
|
||
|
|
||
|
The first use of an instruction related to a dynamically enabled feature is
|
||
|
trapped by the kernel. The trap handler checks whether the process has
|
||
|
permission to use the feature. If the process has no permission then the
|
||
|
kernel sends SIGILL to the application. If the process has permission then
|
||
|
the handler allocates a larger xstate buffer for the task so the large
|
||
|
state can be context switched. In the unlikely cases that the allocation
|
||
|
fails, the kernel sends SIGSEGV.
|
||
|
|
||
|
AMX TILE_DATA enabling example
|
||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
|
||
|
Below is the example of how userspace applications enable
|
||
|
TILE_DATA dynamically:
|
||
|
|
||
|
1. The application first needs to query the kernel for AMX
|
||
|
support::
|
||
|
|
||
|
#include <asm/prctl.h>
|
||
|
#include <sys/syscall.h>
|
||
|
#include <stdio.h>
|
||
|
#include <unistd.h>
|
||
|
|
||
|
#ifndef ARCH_GET_XCOMP_SUPP
|
||
|
#define ARCH_GET_XCOMP_SUPP 0x1021
|
||
|
#endif
|
||
|
|
||
|
#ifndef ARCH_XCOMP_TILECFG
|
||
|
#define ARCH_XCOMP_TILECFG 17
|
||
|
#endif
|
||
|
|
||
|
#ifndef ARCH_XCOMP_TILEDATA
|
||
|
#define ARCH_XCOMP_TILEDATA 18
|
||
|
#endif
|
||
|
|
||
|
#define MASK_XCOMP_TILE ((1 << ARCH_XCOMP_TILECFG) | \
|
||
|
(1 << ARCH_XCOMP_TILEDATA))
|
||
|
|
||
|
unsigned long features;
|
||
|
long rc;
|
||
|
|
||
|
...
|
||
|
|
||
|
rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_SUPP, &features);
|
||
|
|
||
|
if (!rc && (features & MASK_XCOMP_TILE) == MASK_XCOMP_TILE)
|
||
|
printf("AMX is available.\n");
|
||
|
|
||
|
2. After that, determining support for AMX, an application must
|
||
|
explicitly ask permission to use it::
|
||
|
|
||
|
#ifndef ARCH_REQ_XCOMP_PERM
|
||
|
#define ARCH_REQ_XCOMP_PERM 0x1023
|
||
|
#endif
|
||
|
|
||
|
...
|
||
|
|
||
|
rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, ARCH_XCOMP_TILEDATA);
|
||
|
|
||
|
if (!rc)
|
||
|
printf("AMX is ready for use.\n");
|
||
|
|
||
|
Note this example does not include the sigaltstack preparation.
|
||
|
|
||
|
Dynamic features in signal frames
|
||
|
---------------------------------
|
||
|
|
||
|
Dynamcally enabled features are not written to the signal frame upon signal
|
||
|
entry if the feature is in its initial configuration. This differs from
|
||
|
non-dynamic features which are always written regardless of their
|
||
|
configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV
|
||
|
field to determine if a features was written.
|
||
|
|
||
|
Dynamic features for virtual machines
|
||
|
-------------------------------------
|
||
|
|
||
|
The permission for the guest state component needs to be managed separately
|
||
|
from the host, as they are exclusive to each other. A coupled of options
|
||
|
are extended to control the guest permission:
|
||
|
|
||
|
-ARCH_GET_XCOMP_GUEST_PERM
|
||
|
|
||
|
arch_prctl(ARCH_GET_XCOMP_GUEST_PERM, &features);
|
||
|
|
||
|
ARCH_GET_XCOMP_GUEST_PERM is a variant of ARCH_GET_XCOMP_PERM. So it
|
||
|
provides the same semantics and functionality but for the guest
|
||
|
components.
|
||
|
|
||
|
-ARCH_REQ_XCOMP_GUEST_PERM
|
||
|
|
||
|
arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM, feature_nr);
|
||
|
|
||
|
ARCH_REQ_XCOMP_GUEST_PERM is a variant of ARCH_REQ_XCOMP_PERM. It has the
|
||
|
same semantics for the guest permission. While providing a similar
|
||
|
functionality, this comes with a constraint. Permission is frozen when the
|
||
|
first VCPU is created. Any attempt to change permission after that point
|
||
|
is going to be rejected. So, the permission has to be requested before the
|
||
|
first VCPU creation.
|
||
|
|
||
|
Note that some VMMs may have already established a set of supported state
|
||
|
components. These options are not presumed to support any particular VMM.
|