123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394 |
- Using RCU (Read-Copy-Update) for synchronization
- ================================================
- Read-copy update (RCU) is a synchronization mechanism that is used to
- protect read-mostly data structures. RCU is very efficient and scalable
- on the read side (it is wait-free), and thus can make the read paths
- extremely fast.
- RCU supports concurrency between a single writer and multiple readers,
- thus it is not used alone. Typically, the write-side will use a lock to
- serialize multiple updates, but other approaches are possible (e.g.,
- restricting updates to a single task). In QEMU, when a lock is used,
- this will often be the "iothread mutex", also known as the "big QEMU
- lock" (BQL). Also, restricting updates to a single task is done in
- QEMU using the "bottom half" API.
- RCU is fundamentally a "wait-to-finish" mechanism. The read side marks
- sections of code with "critical sections", and the update side will wait
- for the execution of all *currently running* critical sections before
- proceeding, or before asynchronously executing a callback.
- The key point here is that only the currently running critical sections
- are waited for; critical sections that are started **after** the beginning
- of the wait do not extend the wait, despite running concurrently with
- the updater. This is the reason why RCU is more scalable than,
- for example, reader-writer locks. It is so much more scalable that
- the system will have a single instance of the RCU mechanism; a single
- mechanism can be used for an arbitrary number of "things", without
- having to worry about things such as contention or deadlocks.
- How is this possible? The basic idea is to split updates in two phases,
- "removal" and "reclamation". During removal, we ensure that subsequent
- readers will not be able to get a reference to the old data. After
- removal has completed, a critical section will not be able to access
- the old data. Therefore, critical sections that begin after removal
- do not matter; as soon as all previous critical sections have finished,
- there cannot be any readers who hold references to the data structure,
- and these can now be safely reclaimed (e.g., freed or unref'ed).
- Here is a picture::
- thread 1 thread 2 thread 3
- ------------------- ------------------------ -------------------
- enter RCU crit.sec.
- | finish removal phase
- | begin wait
- | | enter RCU crit.sec.
- exit RCU crit.sec | |
- complete wait |
- begin reclamation phase |
- exit RCU crit.sec.
- Note how thread 3 is still executing its critical section when thread 2
- starts reclaiming data. This is possible, because the old version of the
- data structure was not accessible at the time thread 3 began executing
- that critical section.
- RCU API
- -------
- The core RCU API is small:
- ``void rcu_read_lock(void);``
- Used by a reader to inform the reclaimer that the reader is
- entering an RCU read-side critical section.
- ``void rcu_read_unlock(void);``
- Used by a reader to inform the reclaimer that the reader is
- exiting an RCU read-side critical section. Note that RCU
- read-side critical sections may be nested and/or overlapping.
- ``void synchronize_rcu(void);``
- Blocks until all pre-existing RCU read-side critical sections
- on all threads have completed. This marks the end of the removal
- phase and the beginning of reclamation phase.
- Note that it would be valid for another update to come while
- ``synchronize_rcu`` is running. Because of this, it is better that
- the updater releases any locks it may hold before calling
- ``synchronize_rcu``. If this is not possible (for example, because
- the updater is protected by the BQL), you can use ``call_rcu``.
- ``void call_rcu1(struct rcu_head * head, void (*func)(struct rcu_head *head));``
- This function invokes ``func(head)`` after all pre-existing RCU
- read-side critical sections on all threads have completed. This
- marks the end of the removal phase, with func taking care
- asynchronously of the reclamation phase.
- The ``foo`` struct needs to have an ``rcu_head`` structure added,
- perhaps as follows::
- struct foo {
- struct rcu_head rcu;
- int a;
- char b;
- long c;
- };
- so that the reclaimer function can fetch the ``struct foo`` address
- and free it::
- call_rcu1(&foo.rcu, foo_reclaim);
- void foo_reclaim(struct rcu_head *rp)
- {
- struct foo *fp = container_of(rp, struct foo, rcu);
- g_free(fp);
- }
- ``call_rcu1`` is typically used via either the ``call_rcu`` or
- ``g_free_rcu`` macros, which handle the common case where the
- ``rcu_head`` member is the first of the struct.
- ``void call_rcu(T *p, void (*func)(T *p), field-name);``
- If the ``struct rcu_head`` is the first field in the struct, you can
- use this macro instead of ``call_rcu1``.
- ``void g_free_rcu(T *p, field-name);``
- This is a special-case version of ``call_rcu`` where the callback
- function is ``g_free``.
- In the example given in ``call_rcu1``, one could have written simply::
- g_free_rcu(&foo, rcu);
- ``typeof(*p) qatomic_rcu_read(p);``
- ``qatomic_rcu_read()`` is similar to ``qatomic_load_acquire()``, but
- it makes some assumptions on the code that calls it. This allows a
- more optimized implementation.
- ``qatomic_rcu_read`` assumes that whenever a single RCU critical
- section reads multiple shared data, these reads are either
- data-dependent or need no ordering. This is almost always the
- case when using RCU, because read-side critical sections typically
- navigate one or more pointers (the pointers that are changed on
- every update) until reaching a data structure of interest,
- and then read from there.
- RCU read-side critical sections must use ``qatomic_rcu_read()`` to
- read data, unless concurrent writes are prevented by another
- synchronization mechanism.
- Furthermore, RCU read-side critical sections should traverse the
- data structure in a single direction, opposite to the direction
- in which the updater initializes it.
- ``void qatomic_rcu_set(p, typeof(*p) v);``
- ``qatomic_rcu_set()`` is similar to ``qatomic_store_release()``,
- though it also makes assumptions on the code that calls it in
- order to allow a more optimized implementation.
- In particular, ``qatomic_rcu_set()`` suffices for synchronization
- with readers, if the updater never mutates a field within a
- data item that is already accessible to readers. This is the
- case when initializing a new copy of the RCU-protected data
- structure; just ensure that initialization of ``*p`` is carried out
- before ``qatomic_rcu_set()`` makes the data item visible to readers.
- If this rule is observed, writes will happen in the opposite
- order as reads in the RCU read-side critical sections (or if
- there is just one update), and there will be no need for other
- synchronization mechanism to coordinate the accesses.
- The following APIs must be used before RCU is used in a thread:
- ``void rcu_register_thread(void);``
- Mark a thread as taking part in the RCU mechanism. Such a thread
- will have to report quiescent points regularly, either manually
- or through the ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
- ``void rcu_unregister_thread(void);``
- Mark a thread as not taking part anymore in the RCU mechanism.
- It is not a problem if such a thread reports quiescent points,
- either manually or by using the
- ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
- Note that these APIs are relatively heavyweight, and should **not** be
- nested.
- Convenience macros
- ------------------
- Two macros are provided that automatically release the read lock at the
- end of the scope.
- ``RCU_READ_LOCK_GUARD()``
- Takes the lock and will release it at the end of the block it's
- used in.
- ``WITH_RCU_READ_LOCK_GUARD() { code }``
- Is used at the head of a block to protect the code within the block.
- Note that a ``goto`` out of the guarded block will also drop the lock.
- Differences with Linux
- ----------------------
- - Waiting on a mutex is possible, though discouraged, within an RCU critical
- section. This is because spinlocks are rarely (if ever) used in userspace
- programming; not allowing this would prevent upgrading an RCU read-side
- critical section to become an updater.
- - ``qatomic_rcu_read`` and ``qatomic_rcu_set`` replace ``rcu_dereference`` and
- ``rcu_assign_pointer``. They take a **pointer** to the variable being accessed.
- - ``call_rcu`` is a macro that has an extra argument (the name of the first
- field in the struct, which must be a struct ``rcu_head``), and expects the
- type of the callback's argument to be the type of the first argument.
- ``call_rcu1`` is the same as Linux's ``call_rcu``.
- RCU Patterns
- ------------
- Many patterns using read-writer locks translate directly to RCU, with
- the advantages of higher scalability and deadlock immunity.
- In general, RCU can be used whenever it is possible to create a new
- "version" of a data structure every time the updater runs. This may
- sound like a very strict restriction, however:
- - the updater does not mean "everything that writes to a data structure",
- but rather "everything that involves a reclamation step". See the
- array example below
- - in some cases, creating a new version of a data structure may actually
- be very cheap. For example, modifying the "next" pointer of a singly
- linked list is effectively creating a new version of the list.
- Here are some frequently-used RCU idioms that are worth noting.
- RCU list processing
- ^^^^^^^^^^^^^^^^^^^
- TBD (not yet used in QEMU)
- RCU reference counting
- ^^^^^^^^^^^^^^^^^^^^^^
- Because grace periods are not allowed to complete while there is an RCU
- read-side critical section in progress, the RCU read-side primitives
- may be used as a restricted reference-counting mechanism. For example,
- consider the following code fragment::
- rcu_read_lock();
- p = qatomic_rcu_read(&foo);
- /* do something with p. */
- rcu_read_unlock();
- The RCU read-side critical section ensures that the value of ``p`` remains
- valid until after the ``rcu_read_unlock()``. In some sense, it is acquiring
- a reference to ``p`` that is later released when the critical section ends.
- The write side looks simply like this (with appropriate locking)::
- qemu_mutex_lock(&foo_mutex);
- old = foo;
- qatomic_rcu_set(&foo, new);
- qemu_mutex_unlock(&foo_mutex);
- synchronize_rcu();
- free(old);
- If the processing cannot be done purely within the critical section, it
- is possible to combine this idiom with a "real" reference count::
- rcu_read_lock();
- p = qatomic_rcu_read(&foo);
- foo_ref(p);
- rcu_read_unlock();
- /* do something with p. */
- foo_unref(p);
- The write side can be like this::
- qemu_mutex_lock(&foo_mutex);
- old = foo;
- qatomic_rcu_set(&foo, new);
- qemu_mutex_unlock(&foo_mutex);
- synchronize_rcu();
- foo_unref(old);
- or with ``call_rcu``::
- qemu_mutex_lock(&foo_mutex);
- old = foo;
- qatomic_rcu_set(&foo, new);
- qemu_mutex_unlock(&foo_mutex);
- call_rcu(foo_unref, old, rcu);
- In both cases, the write side only performs removal. Reclamation
- happens when the last reference to a ``foo`` object is dropped.
- Using ``synchronize_rcu()`` is undesirably expensive, because the
- last reference may be dropped on the read side. Hence you can
- use ``call_rcu()`` instead::
- foo_unref(struct foo *p) {
- if (qatomic_fetch_dec(&p->refcount) == 1) {
- call_rcu(foo_destroy, p, rcu);
- }
- }
- Note that the same idioms would be possible with reader/writer
- locks::
- read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
- p = foo; p = foo;
- /* do something with p. */ foo = new;
- read_unlock(&foo_rwlock); free(p);
- write_mutex_unlock(&foo_rwlock);
- free(p);
- ------------------------------------------------------------------
- read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
- p = foo; old = foo;
- foo_ref(p); foo = new;
- read_unlock(&foo_rwlock); foo_unref(old);
- /* do something with p. */ write_mutex_unlock(&foo_rwlock);
- read_lock(&foo_rwlock);
- foo_unref(p);
- read_unlock(&foo_rwlock);
- ``foo_unref`` could use a mechanism such as bottom halves to move deallocation
- out of the write-side critical section.
- RCU resizable arrays
- ^^^^^^^^^^^^^^^^^^^^
- Resizable arrays can be used with RCU. The expensive RCU synchronization
- (or ``call_rcu``) only needs to take place when the array is resized.
- The two items to take care of are:
- - ensuring that the old version of the array is available between removal
- and reclamation;
- - avoiding mismatches in the read side between the array data and the
- array size.
- The first problem is avoided simply by not using ``realloc``. Instead,
- each resize will allocate a new array and copy the old data into it.
- The second problem would arise if the size and the data pointers were
- two members of a larger struct::
- struct mystuff {
- ...
- int data_size;
- int data_alloc;
- T *data;
- ...
- };
- Instead, we store the size of the array with the array itself::
- struct arr {
- int size;
- int alloc;
- T data[];
- };
- struct arr *global_array;
- read side:
- rcu_read_lock();
- struct arr *array = qatomic_rcu_read(&global_array);
- x = i < array->size ? array->data[i] : -1;
- rcu_read_unlock();
- return x;
- write side (running under a lock):
- if (global_array->size == global_array->alloc) {
- /* Creating a new version. */
- new_array = g_malloc(sizeof(struct arr) +
- global_array->alloc * 2 * sizeof(T));
- new_array->size = global_array->size;
- new_array->alloc = global_array->alloc * 2;
- memcpy(new_array->data, global_array->data,
- global_array->alloc * sizeof(T));
- /* Removal phase. */
- old_array = global_array;
- qatomic_rcu_set(&global_array, new_array);
- synchronize_rcu();
- /* Reclamation phase. */
- free(old_array);
- }
- References
- ----------
- * The `Linux kernel RCU documentation <https://docs.kernel.org/RCU/>`__
|