557 lines
22 KiB
Plaintext
557 lines
22 KiB
Plaintext
This document gives an overview of the categories of memory-ordering
|
|
operations provided by the Linux-kernel memory model (LKMM).
|
|
|
|
|
|
Categories of Ordering
|
|
======================
|
|
|
|
This section lists LKMM's three top-level categories of memory-ordering
|
|
operations in decreasing order of strength:
|
|
|
|
1. Barriers (also known as "fences"). A barrier orders some or
|
|
all of the CPU's prior operations against some or all of its
|
|
subsequent operations.
|
|
|
|
2. Ordered memory accesses. These operations order themselves
|
|
against some or all of the CPU's prior accesses or some or all
|
|
of the CPU's subsequent accesses, depending on the subcategory
|
|
of the operation.
|
|
|
|
3. Unordered accesses, as the name indicates, have no ordering
|
|
properties except to the extent that they interact with an
|
|
operation in the previous categories. This being the real world,
|
|
some of these "unordered" operations provide limited ordering
|
|
in some special situations.
|
|
|
|
Each of the above categories is described in more detail by one of the
|
|
following sections.
|
|
|
|
|
|
Barriers
|
|
========
|
|
|
|
Each of the following categories of barriers is described in its own
|
|
subsection below:
|
|
|
|
a. Full memory barriers.
|
|
|
|
b. Read-modify-write (RMW) ordering augmentation barriers.
|
|
|
|
c. Write memory barrier.
|
|
|
|
d. Read memory barrier.
|
|
|
|
e. Compiler barrier.
|
|
|
|
Note well that many of these primitives generate absolutely no code
|
|
in kernels built with CONFIG_SMP=n. Therefore, if you are writing
|
|
a device driver, which must correctly order accesses to a physical
|
|
device even in kernels built with CONFIG_SMP=n, please use the
|
|
ordering primitives provided for that purpose. For example, instead of
|
|
smp_mb(), use mb(). See the "Linux Kernel Device Drivers" book or the
|
|
https://lwn.net/Articles/698014/ article for more information.
|
|
|
|
|
|
Full Memory Barriers
|
|
--------------------
|
|
|
|
The Linux-kernel primitives that provide full ordering include:
|
|
|
|
o The smp_mb() full memory barrier.
|
|
|
|
o Value-returning RMW atomic operations whose names do not end in
|
|
_acquire, _release, or _relaxed.
|
|
|
|
o RCU's grace-period primitives.
|
|
|
|
First, the smp_mb() full memory barrier orders all of the CPU's prior
|
|
accesses against all subsequent accesses from the viewpoint of all CPUs.
|
|
In other words, all CPUs will agree that any earlier action taken
|
|
by that CPU happened before any later action taken by that same CPU.
|
|
For example, consider the following:
|
|
|
|
WRITE_ONCE(x, 1);
|
|
smp_mb(); // Order store to x before load from y.
|
|
r1 = READ_ONCE(y);
|
|
|
|
All CPUs will agree that the store to "x" happened before the load
|
|
from "y", as indicated by the comment. And yes, please comment your
|
|
memory-ordering primitives. It is surprisingly hard to remember their
|
|
purpose after even a few months.
|
|
|
|
Second, some RMW atomic operations provide full ordering. These
|
|
operations include value-returning RMW atomic operations (that is, those
|
|
with non-void return types) whose names do not end in _acquire, _release,
|
|
or _relaxed. Examples include atomic_add_return(), atomic_dec_and_test(),
|
|
cmpxchg(), and xchg(). Note that conditional RMW atomic operations such
|
|
as cmpxchg() are only guaranteed to provide ordering when they succeed.
|
|
When RMW atomic operations provide full ordering, they partition the
|
|
CPU's accesses into three groups:
|
|
|
|
1. All code that executed prior to the RMW atomic operation.
|
|
|
|
2. The RMW atomic operation itself.
|
|
|
|
3. All code that executed after the RMW atomic operation.
|
|
|
|
All CPUs will agree that any operation in a given partition happened
|
|
before any operation in a higher-numbered partition.
|
|
|
|
In contrast, non-value-returning RMW atomic operations (that is, those
|
|
with void return types) do not guarantee any ordering whatsoever. Nor do
|
|
value-returning RMW atomic operations whose names end in _relaxed.
|
|
Examples of the former include atomic_inc() and atomic_dec(),
|
|
while examples of the latter include atomic_cmpxchg_relaxed() and
|
|
atomic_xchg_relaxed(). Similarly, value-returning non-RMW atomic
|
|
operations such as atomic_read() do not guarantee full ordering, and
|
|
are covered in the later section on unordered operations.
|
|
|
|
Value-returning RMW atomic operations whose names end in _acquire or
|
|
_release provide limited ordering, and will be described later in this
|
|
document.
|
|
|
|
Finally, RCU's grace-period primitives provide full ordering. These
|
|
primitives include synchronize_rcu(), synchronize_rcu_expedited(),
|
|
synchronize_srcu() and so on. However, these primitives have orders
|
|
of magnitude greater overhead than smp_mb(), atomic_xchg(), and so on.
|
|
Furthermore, RCU's grace-period primitives can only be invoked in
|
|
sleepable contexts. Therefore, RCU's grace-period primitives are
|
|
typically instead used to provide ordering against RCU read-side critical
|
|
sections, as documented in their comment headers. But of course if you
|
|
need a synchronize_rcu() to interact with readers, it costs you nothing
|
|
to also rely on its additional full-memory-barrier semantics. Just please
|
|
carefully comment this, otherwise your future self will hate you.
|
|
|
|
|
|
RMW Ordering Augmentation Barriers
|
|
----------------------------------
|
|
|
|
As noted in the previous section, non-value-returning RMW operations
|
|
such as atomic_inc() and atomic_dec() guarantee no ordering whatsoever.
|
|
Nevertheless, a number of popular CPU families, including x86, provide
|
|
full ordering for these primitives. One way to obtain full ordering on
|
|
all architectures is to add a call to smp_mb():
|
|
|
|
WRITE_ONCE(x, 1);
|
|
atomic_inc(&my_counter);
|
|
smp_mb(); // Inefficient on x86!!!
|
|
r1 = READ_ONCE(y);
|
|
|
|
This works, but the added smp_mb() adds needless overhead for
|
|
x86, on which atomic_inc() provides full ordering all by itself.
|
|
The smp_mb__after_atomic() primitive can be used instead:
|
|
|
|
WRITE_ONCE(x, 1);
|
|
atomic_inc(&my_counter);
|
|
smp_mb__after_atomic(); // Order store to x before load from y.
|
|
r1 = READ_ONCE(y);
|
|
|
|
The smp_mb__after_atomic() primitive emits code only on CPUs whose
|
|
atomic_inc() implementations do not guarantee full ordering, thus
|
|
incurring no unnecessary overhead on x86. There are a number of
|
|
variations on the smp_mb__*() theme:
|
|
|
|
o smp_mb__before_atomic(), which provides full ordering prior
|
|
to an unordered RMW atomic operation.
|
|
|
|
o smp_mb__after_atomic(), which, as shown above, provides full
|
|
ordering subsequent to an unordered RMW atomic operation.
|
|
|
|
o smp_mb__after_spinlock(), which provides full ordering subsequent
|
|
to a successful spinlock acquisition. Note that spin_lock() is
|
|
always successful but spin_trylock() might not be.
|
|
|
|
o smp_mb__after_srcu_read_unlock(), which provides full ordering
|
|
subsequent to an srcu_read_unlock().
|
|
|
|
It is bad practice to place code between the smp__*() primitive and the
|
|
operation whose ordering that it is augmenting. The reason is that the
|
|
ordering of this intervening code will differ from one CPU architecture
|
|
to another.
|
|
|
|
|
|
Write Memory Barrier
|
|
--------------------
|
|
|
|
The Linux kernel's write memory barrier is smp_wmb(). If a CPU executes
|
|
the following code:
|
|
|
|
WRITE_ONCE(x, 1);
|
|
smp_wmb();
|
|
WRITE_ONCE(y, 1);
|
|
|
|
Then any given CPU will see the write to "x" has having happened before
|
|
the write to "y". However, you are usually better off using a release
|
|
store, as described in the "Release Operations" section below.
|
|
|
|
Note that smp_wmb() might fail to provide ordering for unmarked C-language
|
|
stores because profile-driven optimization could determine that the
|
|
value being overwritten is almost always equal to the new value. Such a
|
|
compiler might then reasonably decide to transform "x = 1" and "y = 1"
|
|
as follows:
|
|
|
|
if (x != 1)
|
|
x = 1;
|
|
smp_wmb(); // BUG: does not order the reads!!!
|
|
if (y != 1)
|
|
y = 1;
|
|
|
|
Therefore, if you need to use smp_wmb() with unmarked C-language writes,
|
|
you will need to make sure that none of the compilers used to build
|
|
the Linux kernel carry out this sort of transformation, both now and in
|
|
the future.
|
|
|
|
|
|
Read Memory Barrier
|
|
-------------------
|
|
|
|
The Linux kernel's read memory barrier is smp_rmb(). If a CPU executes
|
|
the following code:
|
|
|
|
r0 = READ_ONCE(y);
|
|
smp_rmb();
|
|
r1 = READ_ONCE(x);
|
|
|
|
Then any given CPU will see the read from "y" as having preceded the read from
|
|
"x". However, you are usually better off using an acquire load, as described
|
|
in the "Acquire Operations" section below.
|
|
|
|
Compiler Barrier
|
|
----------------
|
|
|
|
The Linux kernel's compiler barrier is barrier(). This primitive
|
|
prohibits compiler code-motion optimizations that might move memory
|
|
references across the point in the code containing the barrier(), but
|
|
does not constrain hardware memory ordering. For example, this can be
|
|
used to prevent to compiler from moving code across an infinite loop:
|
|
|
|
WRITE_ONCE(x, 1);
|
|
while (dontstop)
|
|
barrier();
|
|
r1 = READ_ONCE(y);
|
|
|
|
Without the barrier(), the compiler would be within its rights to move the
|
|
WRITE_ONCE() to follow the loop. This code motion could be problematic
|
|
in the case where an interrupt handler terminates the loop. Another way
|
|
to handle this is to use READ_ONCE() for the load of "dontstop".
|
|
|
|
Note that the barriers discussed previously use barrier() or its low-level
|
|
equivalent in their implementations.
|
|
|
|
|
|
Ordered Memory Accesses
|
|
=======================
|
|
|
|
The Linux kernel provides a wide variety of ordered memory accesses:
|
|
|
|
a. Release operations.
|
|
|
|
b. Acquire operations.
|
|
|
|
c. RCU read-side ordering.
|
|
|
|
d. Control dependencies.
|
|
|
|
Each of the above categories has its own section below.
|
|
|
|
|
|
Release Operations
|
|
------------------
|
|
|
|
Release operations include smp_store_release(), atomic_set_release(),
|
|
rcu_assign_pointer(), and value-returning RMW operations whose names
|
|
end in _release. These operations order their own store against all
|
|
of the CPU's prior memory accesses. Release operations often provide
|
|
improved readability and performance compared to explicit barriers.
|
|
For example, use of smp_store_release() saves a line compared to the
|
|
smp_wmb() example above:
|
|
|
|
WRITE_ONCE(x, 1);
|
|
smp_store_release(&y, 1);
|
|
|
|
More important, smp_store_release() makes it easier to connect up the
|
|
different pieces of the concurrent algorithm. The variable stored to
|
|
by the smp_store_release(), in this case "y", will normally be used in
|
|
an acquire operation in other parts of the concurrent algorithm.
|
|
|
|
To see the performance advantages, suppose that the above example read
|
|
from "x" instead of writing to it. Then an smp_wmb() could not guarantee
|
|
ordering, and an smp_mb() would be needed instead:
|
|
|
|
r1 = READ_ONCE(x);
|
|
smp_mb();
|
|
WRITE_ONCE(y, 1);
|
|
|
|
But smp_mb() often incurs much higher overhead than does
|
|
smp_store_release(), which still provides the needed ordering of "x"
|
|
against "y". On x86, the version using smp_store_release() might compile
|
|
to a simple load instruction followed by a simple store instruction.
|
|
In contrast, the smp_mb() compiles to an expensive instruction that
|
|
provides the needed ordering.
|
|
|
|
There is a wide variety of release operations:
|
|
|
|
o Store operations, including not only the aforementioned
|
|
smp_store_release(), but also atomic_set_release(), and
|
|
atomic_long_set_release().
|
|
|
|
o RCU's rcu_assign_pointer() operation. This is the same as
|
|
smp_store_release() except that: (1) It takes the pointer to
|
|
be assigned to instead of a pointer to that pointer, (2) It
|
|
is intended to be used in conjunction with rcu_dereference()
|
|
and similar rather than smp_load_acquire(), and (3) It checks
|
|
for an RCU-protected pointer in "sparse" runs.
|
|
|
|
o Value-returning RMW operations whose names end in _release,
|
|
such as atomic_fetch_add_release() and cmpxchg_release().
|
|
Note that release ordering is guaranteed only against the
|
|
memory-store portion of the RMW operation, and not against the
|
|
memory-load portion. Note also that conditional operations such
|
|
as cmpxchg_release() are only guaranteed to provide ordering
|
|
when they succeed.
|
|
|
|
As mentioned earlier, release operations are often paired with acquire
|
|
operations, which are the subject of the next section.
|
|
|
|
|
|
Acquire Operations
|
|
------------------
|
|
|
|
Acquire operations include smp_load_acquire(), atomic_read_acquire(),
|
|
and value-returning RMW operations whose names end in _acquire. These
|
|
operations order their own load against all of the CPU's subsequent
|
|
memory accesses. Acquire operations often provide improved performance
|
|
and readability compared to explicit barriers. For example, use of
|
|
smp_load_acquire() saves a line compared to the smp_rmb() example above:
|
|
|
|
r0 = smp_load_acquire(&y);
|
|
r1 = READ_ONCE(x);
|
|
|
|
As with smp_store_release(), this also makes it easier to connect
|
|
the different pieces of the concurrent algorithm by looking for the
|
|
smp_store_release() that stores to "y". In addition, smp_load_acquire()
|
|
improves upon smp_rmb() by ordering against subsequent stores as well
|
|
as against subsequent loads.
|
|
|
|
There are a couple of categories of acquire operations:
|
|
|
|
o Load operations, including not only the aforementioned
|
|
smp_load_acquire(), but also atomic_read_acquire(), and
|
|
atomic64_read_acquire().
|
|
|
|
o Value-returning RMW operations whose names end in _acquire,
|
|
such as atomic_xchg_acquire() and atomic_cmpxchg_acquire().
|
|
Note that acquire ordering is guaranteed only against the
|
|
memory-load portion of the RMW operation, and not against the
|
|
memory-store portion. Note also that conditional operations
|
|
such as atomic_cmpxchg_acquire() are only guaranteed to provide
|
|
ordering when they succeed.
|
|
|
|
Symmetry being what it is, acquire operations are often paired with the
|
|
release operations covered earlier. For example, consider the following
|
|
example, where task0() and task1() execute concurrently:
|
|
|
|
void task0(void)
|
|
{
|
|
WRITE_ONCE(x, 1);
|
|
smp_store_release(&y, 1);
|
|
}
|
|
|
|
void task1(void)
|
|
{
|
|
r0 = smp_load_acquire(&y);
|
|
r1 = READ_ONCE(x);
|
|
}
|
|
|
|
If "x" and "y" are both initially zero, then either r0's final value
|
|
will be zero or r1's final value will be one, thus providing the required
|
|
ordering.
|
|
|
|
|
|
RCU Read-Side Ordering
|
|
----------------------
|
|
|
|
This category includes read-side markers such as rcu_read_lock()
|
|
and rcu_read_unlock() as well as pointer-traversal primitives such as
|
|
rcu_dereference() and srcu_dereference().
|
|
|
|
Compared to locking primitives and RMW atomic operations, markers
|
|
for RCU read-side critical sections incur very low overhead because
|
|
they interact only with the corresponding grace-period primitives.
|
|
For example, the rcu_read_lock() and rcu_read_unlock() markers interact
|
|
with synchronize_rcu(), synchronize_rcu_expedited(), and call_rcu().
|
|
The way this works is that if a given call to synchronize_rcu() cannot
|
|
prove that it started before a given call to rcu_read_lock(), then
|
|
that synchronize_rcu() must block until the matching rcu_read_unlock()
|
|
is reached. For more information, please see the synchronize_rcu()
|
|
docbook header comment and the material in Documentation/RCU.
|
|
|
|
RCU's pointer-traversal primitives, including rcu_dereference() and
|
|
srcu_dereference(), order their load (which must be a pointer) against any
|
|
of the CPU's subsequent memory accesses whose address has been calculated
|
|
from the value loaded. There is said to be an *address dependency*
|
|
from the value returned by the rcu_dereference() or srcu_dereference()
|
|
to that subsequent memory access.
|
|
|
|
A call to rcu_dereference() for a given RCU-protected pointer is
|
|
usually paired with a call to a call to rcu_assign_pointer() for that
|
|
same pointer in much the same way that a call to smp_load_acquire() is
|
|
paired with a call to smp_store_release(). Calls to rcu_dereference()
|
|
and rcu_assign_pointer are often buried in other APIs, for example,
|
|
the RCU list API members defined in include/linux/rculist.h. For more
|
|
information, please see the docbook headers in that file, the most
|
|
recent LWN article on the RCU API (https://lwn.net/Articles/777036/),
|
|
and of course the material in Documentation/RCU.
|
|
|
|
If the pointer value is manipulated between the rcu_dereference()
|
|
that returned it and a later dereference(), please read
|
|
Documentation/RCU/rcu_dereference.rst. It can also be quite helpful to
|
|
review uses in the Linux kernel.
|
|
|
|
|
|
Control Dependencies
|
|
--------------------
|
|
|
|
A control dependency extends from a marked load (READ_ONCE() or stronger)
|
|
through an "if" condition to a marked store (WRITE_ONCE() or stronger)
|
|
that is executed only by one of the legs of that "if" statement.
|
|
Control dependencies are so named because they are mediated by
|
|
control-flow instructions such as comparisons and conditional branches.
|
|
|
|
In short, you can use a control dependency to enforce ordering between
|
|
an READ_ONCE() and a WRITE_ONCE() when there is an "if" condition
|
|
between them. The canonical example is as follows:
|
|
|
|
q = READ_ONCE(a);
|
|
if (q)
|
|
WRITE_ONCE(b, 1);
|
|
|
|
In this case, all CPUs would see the read from "a" as happening before
|
|
the write to "b".
|
|
|
|
However, control dependencies are easily destroyed by compiler
|
|
optimizations, so any use of control dependencies must take into account
|
|
all of the compilers used to build the Linux kernel. Please see the
|
|
"control-dependencies.txt" file for more information.
|
|
|
|
|
|
Unordered Accesses
|
|
==================
|
|
|
|
Each of these two categories of unordered accesses has a section below:
|
|
|
|
a. Unordered marked operations.
|
|
|
|
b. Unmarked C-language accesses.
|
|
|
|
|
|
Unordered Marked Operations
|
|
---------------------------
|
|
|
|
Unordered operations to different variables are just that, unordered.
|
|
However, if a group of CPUs apply these operations to a single variable,
|
|
all the CPUs will agree on the operation order. Of course, the ordering
|
|
of unordered marked accesses can also be constrained using the mechanisms
|
|
described earlier in this document.
|
|
|
|
These operations come in three categories:
|
|
|
|
o Marked writes, such as WRITE_ONCE() and atomic_set(). These
|
|
primitives required the compiler to emit the corresponding store
|
|
instructions in the expected execution order, thus suppressing
|
|
a number of destructive optimizations. However, they provide no
|
|
hardware ordering guarantees, and in fact many CPUs will happily
|
|
reorder marked writes with each other or with other unordered
|
|
operations, unless these operations are to the same variable.
|
|
|
|
o Marked reads, such as READ_ONCE() and atomic_read(). These
|
|
primitives required the compiler to emit the corresponding load
|
|
instructions in the expected execution order, thus suppressing
|
|
a number of destructive optimizations. However, they provide no
|
|
hardware ordering guarantees, and in fact many CPUs will happily
|
|
reorder marked reads with each other or with other unordered
|
|
operations, unless these operations are to the same variable.
|
|
|
|
o Unordered RMW atomic operations. These are non-value-returning
|
|
RMW atomic operations whose names do not end in _acquire or
|
|
_release, and also value-returning RMW operations whose names
|
|
end in _relaxed. Examples include atomic_add(), atomic_or(),
|
|
and atomic64_fetch_xor_relaxed(). These operations do carry
|
|
out the specified RMW operation atomically, for example, five
|
|
concurrent atomic_inc() operations applied to a given variable
|
|
will reliably increase the value of that variable by five.
|
|
However, many CPUs will happily reorder these operations with
|
|
each other or with other unordered operations.
|
|
|
|
This category of operations can be efficiently ordered using
|
|
smp_mb__before_atomic() and smp_mb__after_atomic(), as was
|
|
discussed in the "RMW Ordering Augmentation Barriers" section.
|
|
|
|
In short, these operations can be freely reordered unless they are all
|
|
operating on a single variable or unless they are constrained by one of
|
|
the operations called out earlier in this document.
|
|
|
|
|
|
Unmarked C-Language Accesses
|
|
----------------------------
|
|
|
|
Unmarked C-language accesses are normal variable accesses to normal
|
|
variables, that is, to variables that are not "volatile" and are not
|
|
C11 atomic variables. These operations provide no ordering guarantees,
|
|
and further do not guarantee "atomic" access. For example, the compiler
|
|
might (and sometimes does) split a plain C-language store into multiple
|
|
smaller stores. A load from that same variable running on some other
|
|
CPU while such a store is executing might see a value that is a mashup
|
|
of the old value and the new value.
|
|
|
|
Unmarked C-language accesses are unordered, and are also subject to
|
|
any number of compiler optimizations, many of which can break your
|
|
concurrent code. It is possible to used unmarked C-language accesses for
|
|
shared variables that are subject to concurrent access, but great care
|
|
is required on an ongoing basis. The compiler-constraining barrier()
|
|
primitive can be helpful, as can the various ordering primitives discussed
|
|
in this document. It nevertheless bears repeating that use of unmarked
|
|
C-language accesses requires careful attention to not just your code,
|
|
but to all the compilers that might be used to build it. Such compilers
|
|
might replace a series of loads with a single load, and might replace
|
|
a series of stores with a single store. Some compilers will even split
|
|
a single store into multiple smaller stores.
|
|
|
|
But there are some ways of using unmarked C-language accesses for shared
|
|
variables without such worries:
|
|
|
|
o Guard all accesses to a given variable by a particular lock,
|
|
so that there are never concurrent conflicting accesses to
|
|
that variable. (There are "conflicting accesses" when
|
|
(1) at least one of the concurrent accesses to a variable is an
|
|
unmarked C-language access and (2) when at least one of those
|
|
accesses is a write, whether marked or not.)
|
|
|
|
o As above, but using other synchronization primitives such
|
|
as reader-writer locks or sequence locks.
|
|
|
|
o Use locking or other means to ensure that all concurrent accesses
|
|
to a given variable are reads.
|
|
|
|
o Restrict use of a given variable to statistics or heuristics
|
|
where the occasional bogus value can be tolerated.
|
|
|
|
o Declare the accessed variables as C11 atomics.
|
|
https://lwn.net/Articles/691128/
|
|
|
|
o Declare the accessed variables as "volatile".
|
|
|
|
If you need to live more dangerously, please do take the time to
|
|
understand the compilers. One place to start is these two LWN
|
|
articles:
|
|
|
|
Who's afraid of a big bad optimizing compiler?
|
|
https://lwn.net/Articles/793253
|
|
Calibrating your fear of big bad optimizing compilers
|
|
https://lwn.net/Articles/799218
|
|
|
|
Used properly, unmarked C-language accesses can reduce overhead on
|
|
fastpaths. However, the price is great care and continual attention
|
|
to your compiler as new versions come out and as new optimizations
|
|
are enabled.
|