123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515 |
- .. _atomics-ref:
- =========================
- Atomic operations in QEMU
- =========================
- CPUs perform independent memory operations effectively in random order.
- but this can be a problem for CPU-CPU interaction (including interactions
- between QEMU and the guest). Multi-threaded programs use various tools
- to instruct the compiler and the CPU to restrict the order to something
- that is consistent with the expectations of the programmer.
- The most basic tool is locking. Mutexes, condition variables and
- semaphores are used in QEMU, and should be the default approach to
- synchronization. Anything else is considerably harder, but it's
- also justified more often than one would like;
- the most performance-critical parts of QEMU in particular require
- a very low level approach to concurrency, involving memory barriers
- and atomic operations. The semantics of concurrent memory accesses are governed
- by the C11 memory model.
- QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
- provide better portability and a less verbose syntax. ``qemu/atomic.h``
- provides macros that fall in three camps:
- - compiler barriers: ``barrier()``;
- - weak atomic access and manual memory barriers: ``qatomic_read()``,
- ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
- ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
- ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
- - sequentially consistent atomic access: everything else.
- In general, use of ``qemu/atomic.h`` should be wrapped with more easily
- used data structures (e.g. the lock-free singly-linked list operations
- ``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
- primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of
- atomic operations and memory barriers should be limited to inter-thread
- checking of flags and documented thoroughly.
- Compiler memory barrier
- =======================
- ``barrier()`` prevents the compiler from moving the memory accesses on
- either side of it to the other side. The compiler barrier has no direct
- effect on the CPU, which may then reorder things however it wishes.
- ``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
- architectures, CPU guarantees are strong enough that blocking compiler
- optimizations already ensures the correct order of execution. In this
- case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
- compiler barriers.
- Still, ``barrier()`` can be useful when writing code that can be interrupted
- by signal handlers.
- Sequentially consistent atomic access
- =====================================
- Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
- consistency*, where "the result of any execution is the same as if the
- operations of all the processors were executed in some sequential order,
- and the operations of each individual processor appear in this sequence
- in the order specified by its program".
- ``qemu/atomic.h`` provides the following set of atomic read-modify-write
- operations::
- void qatomic_inc(ptr)
- void qatomic_dec(ptr)
- void qatomic_add(ptr, val)
- void qatomic_sub(ptr, val)
- void qatomic_and(ptr, val)
- void qatomic_or(ptr, val)
- typeof(*ptr) qatomic_fetch_inc(ptr)
- typeof(*ptr) qatomic_fetch_dec(ptr)
- typeof(*ptr) qatomic_fetch_add(ptr, val)
- typeof(*ptr) qatomic_fetch_sub(ptr, val)
- typeof(*ptr) qatomic_fetch_and(ptr, val)
- typeof(*ptr) qatomic_fetch_or(ptr, val)
- typeof(*ptr) qatomic_fetch_xor(ptr, val)
- typeof(*ptr) qatomic_fetch_inc_nonzero(ptr)
- typeof(*ptr) qatomic_xchg(ptr, val)
- typeof(*ptr) qatomic_cmpxchg(ptr, old, new)
- all of which return the old value of ``*ptr``. These operations are
- polymorphic; they operate on any type that is as wide as a pointer or
- smaller.
- Similar operations return the new value of ``*ptr``::
- typeof(*ptr) qatomic_inc_fetch(ptr)
- typeof(*ptr) qatomic_dec_fetch(ptr)
- typeof(*ptr) qatomic_add_fetch(ptr, val)
- typeof(*ptr) qatomic_sub_fetch(ptr, val)
- typeof(*ptr) qatomic_and_fetch(ptr, val)
- typeof(*ptr) qatomic_or_fetch(ptr, val)
- typeof(*ptr) qatomic_xor_fetch(ptr, val)
- ``qemu/atomic.h`` also provides an optimized shortcut for
- ``qatomic_set`` followed by ``smp_mb``::
- void qatomic_set_mb(ptr, val)
- Weak atomic access and manual memory barriers
- =============================================
- Compared to sequentially consistent atomic access, programming with
- weaker consistency models can be considerably more complicated.
- The only guarantees that you can rely upon in this case are:
- - atomic accesses will not cause data races (and hence undefined behavior);
- ordinary accesses instead cause data races if they are concurrent with
- other accesses of which at least one is a write. In order to ensure this,
- the compiler will not optimize accesses out of existence, create unsolicited
- accesses, or perform other similar optimizations.
- - acquire operations will appear to happen, with respect to the other
- components of the system, before all the LOAD or STORE operations
- specified afterwards.
- - release operations will appear to happen, with respect to the other
- components of the system, after all the LOAD or STORE operations
- specified before.
- - release operations will *synchronize with* acquire operations;
- see :ref:`acqrel` for a detailed explanation.
- When using this model, variables are accessed with:
- - ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from
- optimizing accesses out of existence and creating unsolicited
- accesses, but do not otherwise impose any ordering on loads and
- stores: both the compiler and the processor are free to reorder
- them.
- - ``qatomic_load_acquire()``, which guarantees the LOAD to appear to
- happen, with respect to the other components of the system,
- before all the LOAD or STORE operations specified afterwards.
- Operations coming before ``qatomic_load_acquire()`` can still be
- reordered after it.
- - ``qatomic_store_release()``, which guarantees the STORE to appear to
- happen, with respect to the other components of the system,
- after all the LOAD or STORE operations specified before.
- Operations coming after ``qatomic_store_release()`` can still be
- reordered before it.
- Restrictions to the ordering of accesses can also be specified
- using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
- ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
- Memory barriers control the order of references to shared memory.
- They come in six kinds:
- - ``smp_rmb()`` guarantees that all the LOAD operations specified before
- the barrier will appear to happen before all the LOAD operations
- specified after the barrier with respect to the other components of
- the system.
- In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
- required to have any effect on stores.
- - ``smp_wmb()`` guarantees that all the STORE operations specified before
- the barrier will appear to happen before all the STORE operations
- specified after the barrier with respect to the other components of
- the system.
- In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
- required to have any effect on loads.
- - ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
- the barrier will appear to happen before all the LOAD or STORE operations
- specified after the barrier with respect to the other components of
- the system.
- - ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
- the barrier will appear to happen after all the LOAD or STORE operations
- specified *before* the barrier with respect to the other components of
- the system.
- - ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
- before the barrier will appear to happen before all the LOAD and
- STORE operations specified after the barrier with respect to the other
- components of the system.
- ``smp_mb()`` puts a partial ordering on both loads and stores. It is
- stronger than both a read and a write memory barrier; it implies both
- ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
- coming before the barrier from overtaking LOADs coming after the
- barrier and vice versa.
- - ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On
- most processors, whenever two loads are performed such that the
- second depends on the result of the first (e.g., the first load
- retrieves the address to which the second load will be directed),
- the processor will guarantee that the first LOAD will appear to happen
- before the second with respect to the other components of the system.
- Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``,
- ``smp_read_barrier_depends()`` can be just a compiler barrier on
- weakly-ordered architectures such as Arm or PPC\ [#alpha]_.
- Note that the first load really has to have a _data_ dependency and not
- a control dependency. If the address for the second load is dependent
- on the first load, but the dependency is through a conditional rather
- than actually loading the address itself, then it's a _control_
- dependency and a full read barrier or better is required.
- .. [#alpha] The DEC Alpha is an exception, because ``smp_read_barrier_depends()``
- needs a processor barrier. On strongly-ordered architectures such
- as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can
- also be compiler barriers only.
- Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are
- mostly used when a data structure has one thread that is always a writer
- and one thread that is always a reader:
- +----------------------------------+----------------------------------+
- | thread 1 | thread 2 |
- +==================================+==================================+
- | :: | :: |
- | | |
- | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); |
- | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); |
- +----------------------------------+----------------------------------+
- In this case, correctness is easy to check for using the "pairing"
- trick that is explained below.
- Sometimes, a thread is accessing many variables that are otherwise
- unrelated to each other (for example because, apart from the current
- thread, exactly one other thread will read or write each of these
- variables). In this case, it is possible to "hoist" the barriers
- outside a loop. For example:
- +------------------------------------------+----------------------------------+
- | before | after |
- +==========================================+==================================+
- | :: | :: |
- | | |
- | n = 0; | n = 0; |
- | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
- | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); |
- | | smp_mb_acquire(); |
- +------------------------------------------+----------------------------------+
- | :: | :: |
- | | |
- | | smp_mb_release(); |
- | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
- | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
- +------------------------------------------+----------------------------------+
- Splitting a loop can also be useful to reduce the number of barriers:
- +------------------------------------------+----------------------------------+
- | before | after |
- +==========================================+==================================+
- | :: | :: |
- | | |
- | n = 0; | smp_mb_release(); |
- | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
- | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
- | smp_mb(); | smb_mb(); |
- | n += qatomic_read(&b[i]); | n = 0; |
- | } | for (i = 0; i < 10; i++) |
- | | n += qatomic_read(&b[i]); |
- +------------------------------------------+----------------------------------+
- In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
- as well) ``smp_wmb()``:
- +------------------------------------------+----------------------------------+
- | before | after |
- +==========================================+==================================+
- | :: | :: |
- | | |
- | | smp_mb_release(); |
- | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
- | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); |
- | qatomic_store_release(&b[i], false); | smb_wmb(); |
- | } | for (i = 0; i < 10; i++) |
- | | qatomic_set(&b[i], false); |
- +------------------------------------------+----------------------------------+
- .. _acqrel:
- Acquire/release pairing and the *synchronizes-with* relation
- ------------------------------------------------------------
- Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have
- either *acquire* or *release* semantics\ [#rmw]_. This has two effects:
- .. [#rmw] Read-modify-write operations can have both---acquire applies to the
- read part, and release to the write.
- - within a thread, they are ordered either before subsequent operations
- (for acquire) or after previous operations (for release).
- - if a release operation in one thread *synchronizes with* an acquire operation
- in another thread, the ordering constraints propagates from the first to the
- second thread. That is, everything before the release operation in the
- first thread is guaranteed to *happen before* everything after the
- acquire operation in the second thread.
- The concept of acquire and release semantics is not exclusive to atomic
- operations; almost all higher-level synchronization primitives also have
- acquire or release semantics. For example:
- - ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
- release semantics and synchronizes with a ``pthread_mutex_lock`` for the
- same mutex.
- - ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
- ``pthread_cond_wait`` has both release semantics (synchronizing with
- ``pthread_mutex_lock``) and acquire semantics (synchronizing with
- ``pthread_mutex_unlock`` and signaling of the condition variable).
- - ``pthread_create`` has release semantics and synchronizes with the start
- of the new thread; ``pthread_join`` has acquire semantics and synchronizes
- with the exiting of the thread.
- - ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
- acquire semantics.
- For example, in the following example there are no atomic accesses, but still
- thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
- (release) and ``pthread_join`` (acquire):
- +----------------------+-------------------------------+
- | thread 1 | thread 2 |
- +======================+===============================+
- | :: | :: |
- | | |
- | *a = 1; | |
- | pthread_exit(a); | pthread_join(thread1, &a); |
- | | x = *a; |
- +----------------------+-------------------------------+
- Synchronization between threads basically descends from this pairing of
- a release operation and an acquire operation. Therefore, atomic operations
- other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be
- paired with another operation of the opposite kind: an acquire operation
- will pair with a release operation and vice versa. This rule of thumb is
- extremely useful; in the case of QEMU, however, note that the other
- operation may actually be in a driver that runs in the guest!
- ``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
- ``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count
- as acquire operations. ``smp_wmb()``, ``smp_mb_release()``,
- ``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release
- operations. ``smp_mb()`` counts as both acquire and release, therefore
- it can pair with any other atomic operation. Here is an example:
- +----------------------+------------------------------+
- | thread 1 | thread 2 |
- +======================+==============================+
- | :: | :: |
- | | |
- | qatomic_set(&a, 1);| |
- | smp_wmb(); | |
- | qatomic_set(&b, 2);| x = qatomic_read(&b); |
- | | smp_rmb(); |
- | | y = qatomic_read(&a); |
- +----------------------+------------------------------+
- Note that a load-store pair only counts if the two operations access the
- same variable: that is, a store-release on a variable ``x`` *synchronizes
- with* a load-acquire on a variable ``x``, while a release barrier
- synchronizes with any acquire operation. The following example shows
- correct synchronization:
- +--------------------------------+--------------------------------+
- | thread 1 | thread 2 |
- +================================+================================+
- | :: | :: |
- | | |
- | qatomic_set(&a, 1); | |
- | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);|
- | | y = qatomic_read(&a); |
- +--------------------------------+--------------------------------+
- Acquire and release semantics of higher-level primitives can also be
- relied upon for the purpose of establishing the *synchronizes with*
- relation.
- Note that the "writing" thread is accessing the variables in the
- opposite order as the "reading" thread. This is expected: stores
- before a release operation will normally match the loads after
- the acquire operation, and vice versa. In fact, this happened already
- in the ``pthread_exit``/``pthread_join`` example above.
- Finally, this more complex example has more than two accesses and data
- dependency barriers. It also does not use atomic accesses whenever there
- cannot be a data race:
- +----------------------+------------------------------+
- | thread 1 | thread 2 |
- +======================+==============================+
- | :: | :: |
- | | |
- | b[2] = 1; | |
- | smp_wmb(); | |
- | x->i = 2; | |
- | smp_wmb(); | |
- | qatomic_set(&a, x);| x = qatomic_read(&a); |
- | | smp_read_barrier_depends(); |
- | | y = x->i; |
- | | smp_read_barrier_depends(); |
- | | z = b[y]; |
- +----------------------+------------------------------+
- Comparison with Linux kernel primitives
- =======================================
- Here is a list of differences between Linux kernel atomic operations
- and memory barriers, and the equivalents in QEMU:
- - atomic operations in Linux are always on a 32-bit int type and
- use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
- and use normal C types.
- - Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
- at all. Linux 4.1 updated them to implement volatile
- semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
- QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed
- semantics if the compiler supports it, and volatile semantics otherwise.
- Both semantics prevent the compiler from doing certain transformations;
- the difference is that atomic accesses are guaranteed to be atomic,
- while volatile accesses aren't. Thus, in the volatile case we just cross
- our fingers hoping that the compiler will generate atomic accesses,
- since we assume the variables passed are machine-word sized and
- properly aligned.
- No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either
- Linux or QEMU.
- - atomic read-modify-write operations in Linux are of three kinds:
- ===================== =========================================
- ``atomic_OP`` returns void
- ``atomic_OP_return`` returns new value of the variable
- ``atomic_fetch_OP`` returns the old value of the variable
- ``atomic_cmpxchg`` returns the old value of the variable
- ===================== =========================================
- In QEMU, the second kind is named ``atomic_OP_fetch``.
- - different atomic read-modify-write operations in Linux imply
- a different set of memory barriers. In QEMU, all of them enforce
- sequential consistency: there is a single order in which the
- program sees them happen.
- - however, according to the C11 memory model that QEMU uses, this order
- does not propagate to other memory accesses on either side of the
- read-modify-write operation. As far as those are concerned, the
- operation consist of just a load-acquire followed by a store-release.
- Stores that precede the RMW operation, and loads that follow it, can
- still be reordered and will happen *in the middle* of the read-modify-write
- operation!
- Therefore, the following example is correct in Linux but not in QEMU:
- +----------------------------------+--------------------------------+
- | Linux (correct) | QEMU (incorrect) |
- +==================================+================================+
- | :: | :: |
- | | |
- | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);|
- | b = READ_ONCE(&y); | b = qatomic_read(&y); |
- +----------------------------------+--------------------------------+
- because the read of ``y`` can be moved (by either the processor or the
- compiler) before the write of ``x``.
- Fixing this requires a full memory barrier between the write of ``x`` and
- the read of ``y``. QEMU provides ``smp_mb__before_rmw()`` and
- ``smp_mb__after_rmw()``; they act both as an optimization,
- avoiding the memory barrier on processors where it is unnecessary,
- and as a clarification of this corner case of the C11 memory model:
- +--------------------------------+
- | QEMU (correct) |
- +================================+
- | :: |
- | |
- | a = qatomic_fetch_add(&x, 2);|
- | smp_mb__after_rmw(); |
- | b = qatomic_read(&y); |
- +--------------------------------+
- In the common case where only one thread writes ``x``, it is also possible
- to write it like this:
- +--------------------------------+
- | QEMU (correct) |
- +================================+
- | :: |
- | |
- | a = qatomic_read(&x); |
- | qatomic_set_mb(&x, a + 2); |
- | b = qatomic_read(&y); |
- +--------------------------------+
- Sources
- =======
- - ``Documentation/memory-barriers.txt`` from the Linux kernel
|