8 år sedan · c6489dd921
--- a/docs/multi-thread-tcg.txt
+++ b/docs/multi-thread-tcg.txt
@@ -0,0 +1,350 @@
 
				+Copyright (c) 2015-2016 Linaro Ltd.
			
 
				+
			
 
				+This work is licensed under the terms of the GNU GPL, version 2 or
			
 
				+later. See the COPYING file in the top-level directory.
			
 
				+
			
 
				+Introduction
			
 
				+============
			
 
				+
			
 
				+This document outlines the design for multi-threaded TCG system-mode
			
 
				+emulation. The current user-mode emulation mirrors the thread
			
 
				+structure of the translated executable. Some of the work will be
			
 
				+applicable to both system and linux-user emulation.
			
 
				+
			
 
				+The original system-mode TCG implementation was single threaded and
			
 
				+dealt with multiple CPUs with simple round-robin scheduling. This
			
 
				+simplified a lot of things but became increasingly limited as systems
			
 
				+being emulated gained additional cores and per-core performance gains
			
 
				+for host systems started to level off.
			
 
				+
			
 
				+vCPU Scheduling
			
 
				+===============
			
 
				+
			
 
				+We introduce a new running mode where each vCPU will run on its own
			
 
				+user-space thread. This will be enabled by default for all FE/BE
			
 
				+combinations that have had the required work done to support this
			
 
				+safely.
			
 
				+
			
 
				+In the general case of running translated code there should be no
			
 
				+inter-vCPU dependencies and all vCPUs should be able to run at full
			
 
				+speed. Synchronisation will only be required while accessing internal
			
 
				+shared data structures or when the emulated architecture requires a
			
 
				+coherent representation of the emulated machine state.
			
 
				+
			
 
				+Shared Data Structures
			
 
				+======================
			
 
				+
			
 
				+Main Run Loop
			
 
				+-------------
			
 
				+
			
 
				+Even when there is no code being generated there are a number of
			
 
				+structures associated with the hot-path through the main run-loop.
			
 
				+These are associated with looking up the next translation block to
			
 
				+execute. These include:
			
 
				+
			
 
				+    tb_jmp_cache (per-vCPU, cache of recent jumps)
			
 
				+    tb_ctx.htable (global hash table, phys address->tb lookup)
			
 
				+
			
 
				+As TB linking only occurs when blocks are in the same page this code
			
 
				+is critical to performance as looking up the next TB to execute is the
			
 
				+most common reason to exit the generated code.
			
 
				+
			
 
				+DESIGN REQUIREMENT: Make access to lookup structures safe with
			
 
				+multiple reader/writer threads. Minimise any lock contention to do it.
			
 
				+
			
 
				+The hot-path avoids using locks where possible. The tb_jmp_cache is
			
 
				+updated with atomic accesses to ensure consistent results. The fall
			
 
				+back QHT based hash table is also designed for lockless lookups. Locks
			
 
				+are only taken when code generation is required or TranslationBlocks
			
 
				+have their block-to-block jumps patched.
			
 
				+
			
 
				+Global TCG State
			
 
				+----------------
			
 
				+
			
 
				+We need to protect the entire code generation cycle including any post
			
 
				+generation patching of the translated code. This also implies a shared
			
 
				+translation buffer which contains code running on all cores. Any
			
 
				+execution path that comes to the main run loop will need to hold a
			
 
				+mutex for code generation. This also includes times when we need flush
			
 
				+code or entries from any shared lookups/caches. Structures held on a
			
 
				+per-vCPU basis won't need locking unless other vCPUs will need to
			
 
				+modify them.
			
 
				+
			
 
				+DESIGN REQUIREMENT: Add locking around all code generation and TB
			
 
				+patching.
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+Mainly as part of the linux-user work all code generation is
			
 
				+serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
			
 
				+place of mmap_lock() in linux-user.
			
 
				+
			
 
				+Translation Blocks
			
 
				+------------------
			
 
				+
			
 
				+Currently the whole system shares a single code generation buffer
			
 
				+which when full will force a flush of all translations and start from
			
 
				+scratch again. Some operations also force a full flush of translations
			
 
				+including:
			
 
				+
			
 
				+  - debugging operations (breakpoint insertion/removal)
			
 
				+  - some CPU helper functions
			
 
				+
			
 
				+This is done with the async_safe_run_on_cpu() mechanism to ensure all
			
 
				+vCPUs are quiescent when changes are being made to shared global
			
 
				+structures.
			
 
				+
			
 
				+More granular translation invalidation events are typically due
			
 
				+to a change of the state of a physical page:
			
 
				+
			
 
				+  - code modification (self modify code, patching code)
			
 
				+  - page changes (new page mapping in linux-user mode)
			
 
				+
			
 
				+While setting the invalid flag in a TranslationBlock will stop it
			
 
				+being used when looked up in the hot-path there are a number of other
			
 
				+book-keeping structures that need to be safely cleared.
			
 
				+
			
 
				+Any TranslationBlocks which have been patched to jump directly to the
			
 
				+now invalid blocks need the jump patches reversing so they will return
			
 
				+to the C code.
			
 
				+
			
 
				+There are a number of look-up caches that need to be properly updated
			
 
				+including the:
			
 
				+
			
 
				+  - jump lookup cache
			
 
				+  - the physical-to-tb lookup hash table
			
 
				+  - the global page table
			
 
				+
			
 
				+The global page table (l1_map) which provides a multi-level look-up
			
 
				+for PageDesc structures which contain pointers to the start of a
			
 
				+linked list of all Translation Blocks in that page (see page_next).
			
 
				+
			
 
				+Both the jump patching and the page cache involve linked lists that
			
 
				+the invalidated TranslationBlock needs to be removed from.
			
 
				+
			
 
				+DESIGN REQUIREMENT: Safely handle invalidation of TBs
			
 
				+                      - safely patch/revert direct jumps
			
 
				+                      - remove central PageDesc lookup entries
			
 
				+                      - ensure lookup caches/hashes are safely updated
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+The direct jump themselves are updated atomically by the TCG
			
 
				+tb_set_jmp_target() code. Modification to the linked lists that allow
			
 
				+searching for linked pages are done under the protect of the
			
 
				+tb_lock().
			
 
				+
			
 
				+The global page table is protected by the tb_lock() in system-mode and
			
 
				+mmap_lock() in linux-user mode.
			
 
				+
			
 
				+The lookup caches are updated atomically and the lookup hash uses QHT
			
 
				+which is designed for concurrent safe lookup.
			
 
				+
			
 
				+
			
 
				+Memory maps and TLBs
			
 
				+--------------------
			
 
				+
			
 
				+The memory handling code is fairly critical to the speed of memory
			
 
				+access in the emulated system. The SoftMMU code is designed so the
			
 
				+hot-path can be handled entirely within translated code. This is
			
 
				+handled with a per-vCPU TLB structure which once populated will allow
			
 
				+a series of accesses to the page to occur without exiting the
			
 
				+translated code. It is possible to set flags in the TLB address which
			
 
				+will ensure the slow-path is taken for each access. This can be done
			
 
				+to support:
			
 
				+
			
 
				+  - Memory regions (dividing up access to PIO, MMIO and RAM)
			
 
				+  - Dirty page tracking (for code gen, SMC detection, migration and display)
			
 
				+  - Virtual TLB (for translating guest address->real address)
			
 
				+
			
 
				+When the TLB tables are updated by a vCPU thread other than their own
			
 
				+we need to ensure it is done in a safe way so no inconsistent state is
			
 
				+seen by the vCPU thread.
			
 
				+
			
 
				+Some operations require updating a number of vCPUs TLBs at the same
			
 
				+time in a synchronised manner.
			
 
				+
			
 
				+DESIGN REQUIREMENTS:
			
 
				+
			
 
				+  - TLB Flush All/Page
			
 
				+    - can be across-vCPUs
			
 
				+    - cross vCPU TLB flush may need other vCPU brought to halt
			
 
				+    - change may need to be visible to the calling vCPU immediately
			
 
				+  - TLB Flag Update
			
 
				+    - usually cross-vCPU
			
 
				+    - want change to be visible as soon as possible
			
 
				+  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
			
 
				+    - This is a per-vCPU table - by definition can't race
			
 
				+    - updated by its own thread when the slow-path is forced
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+We have updated cputlb.c to defer operations when a cross-vCPU
			
 
				+operation with async_run_on_cpu() which ensures each vCPU sees a
			
 
				+coherent state when it next runs its work (in a few instructions
			
 
				+time).
			
 
				+
			
 
				+A new set up operations (tlb_flush_*_all_cpus) take an additional flag
			
 
				+which when set will force synchronisation by setting the source vCPUs
			
 
				+work as "safe work" and exiting the cpu run loop. This ensure by the
			
 
				+time execution restarts all flush operations have completed.
			
 
				+
			
 
				+TLB flag updates are all done atomically and are also protected by the
			
 
				+tb_lock() which is used by the functions that update the TLB in bulk.
			
 
				+
			
 
				+(Known limitation)
			
 
				+
			
 
				+Not really a limitation but the wait mechanism is overly strict for
			
 
				+some architectures which only need flushes completed by a barrier
			
 
				+instruction. This could be a future optimisation.
			
 
				+
			
 
				+Emulated hardware state
			
 
				+-----------------------
			
 
				+
			
 
				+Currently thanks to KVM work any access to IO memory is automatically
			
 
				+protected by the global iothread mutex, also known as the BQL (Big
			
 
				+Qemu Lock). Any IO region that doesn't use global mutex is expected to
			
 
				+do its own locking.
			
 
				+
			
 
				+However IO memory isn't the only way emulated hardware state can be
			
 
				+modified. Some architectures have model specific registers that
			
 
				+trigger hardware emulation features. Generally any translation helper
			
 
				+that needs to update more than a single vCPUs of state should take the
			
 
				+BQL.
			
 
				+
			
 
				+As the BQL, or global iothread mutex is shared across the system we
			
 
				+push the use of the lock as far down into the TCG code as possible to
			
 
				+minimise contention.
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+MMIO access automatically serialises hardware emulation by way of the
			
 
				+BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
			
 
				+and also defer the reset/startup of vCPUs to the vCPU context by way
			
 
				+of async_run_on_cpu().
			
 
				+
			
 
				+Updates to interrupt state are also protected by the BQL as they can
			
 
				+often be cross vCPU.
			
 
				+
			
 
				+Memory Consistency
			
 
				+==================
			
 
				+
			
 
				+Between emulated guests and host systems there are a range of memory
			
 
				+consistency models. Even emulating weakly ordered systems on strongly
			
 
				+ordered hosts needs to ensure things like store-after-load re-ordering
			
 
				+can be prevented when the guest wants to.
			
 
				+
			
 
				+Memory Barriers
			
 
				+---------------
			
 
				+
			
 
				+Barriers (sometimes known as fences) provide a mechanism for software
			
 
				+to enforce a particular ordering of memory operations from the point
			
 
				+of view of external observers (e.g. another processor core). They can
			
 
				+apply to any memory operations as well as just loads or stores.
			
 
				+
			
 
				+The Linux kernel has an excellent write-up on the various forms of
			
 
				+memory barrier and the guarantees they can provide [1].
			
 
				+
			
 
				+Barriers are often wrapped around synchronisation primitives to
			
 
				+provide explicit memory ordering semantics. However they can be used
			
 
				+by themselves to provide safe lockless access by ensuring for example
			
 
				+a change to a signal flag will only be visible once the changes to
			
 
				+payload are.
			
 
				+
			
 
				+DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
			
 
				+
			
 
				+This would enforce a strong load/store ordering so all loads/stores
			
 
				+complete at the memory barrier. On single-core non-SMP strongly
			
 
				+ordered backends this could become a NOP.
			
 
				+
			
 
				+Aside from explicit standalone memory barrier instructions there are
			
 
				+also implicit memory ordering semantics which comes with each guest
			
 
				+memory access instruction. For example all x86 load/stores come with
			
 
				+fairly strong guarantees of sequential consistency where as ARM has
			
 
				+special variants of load/store instructions that imply acquire/release
			
 
				+semantics.
			
 
				+
			
 
				+In the case of a strongly ordered guest architecture being emulated on
			
 
				+a weakly ordered host the scope for a heavy performance impact is
			
 
				+quite high.
			
 
				+
			
 
				+DESIGN REQUIREMENTS: Be efficient with use of memory barriers
			
 
				+       - host systems with stronger implied guarantees can skip some barriers
			
 
				+       - merge consecutive barriers to the strongest one
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+The system currently has a tcg_gen_mb() which will add memory barrier
			
 
				+operations if code generation is being done in a parallel context. The
			
 
				+tcg_optimize() function attempts to merge barriers up to their
			
 
				+strongest form before any load/store operations. The solution was
			
 
				+originally developed and tested for linux-user based systems. All
			
 
				+backends have been converted to emit fences when required. So far the
			
 
				+following front-ends have been updated to emit fences when required:
			
 
				+
			
 
				+    - target-i386
			
 
				+    - target-arm
			
 
				+    - target-aarch64
			
 
				+    - target-alpha
			
 
				+    - target-mips
			
 
				+
			
 
				+Memory Control and Maintenance
			
 
				+------------------------------
			
 
				+
			
 
				+This includes a class of instructions for controlling system cache
			
 
				+behaviour. While QEMU doesn't model cache behaviour these instructions
			
 
				+are often seen when code modification has taken place to ensure the
			
 
				+changes take effect.
			
 
				+
			
 
				+Synchronisation Primitives
			
 
				+--------------------------
			
 
				+
			
 
				+There are two broad types of synchronisation primitives found in
			
 
				+modern ISAs: atomic instructions and exclusive regions.
			
 
				+
			
 
				+The first type offer a simple atomic instruction which will guarantee
			
 
				+some sort of test and conditional store will be truly atomic w.r.t.
			
 
				+other cores sharing access to the memory. The classic example is the
			
 
				+x86 cmpxchg instruction.
			
 
				+
			
 
				+The second type offer a pair of load/store instructions which offer a
			
 
				+guarantee that an region of memory has not been touched between the
			
 
				+load and store instructions. An example of this is ARM's ldrex/strex
			
 
				+pair where the strex instruction will return a flag indicating a
			
 
				+successful store only if no other CPU has accessed the memory region
			
 
				+since the ldrex.
			
 
				+
			
 
				+Traditionally TCG has generated a series of operations that work
			
 
				+because they are within the context of a single translation block so
			
 
				+will have completed before another CPU is scheduled. However with
			
 
				+the ability to have multiple threads running to emulate multiple CPUs
			
 
				+we will need to explicitly expose these semantics.
			
 
				+
			
 
				+DESIGN REQUIREMENTS:
			
 
				+  - Support classic atomic instructions
			
 
				+  - Support load/store exclusive (or load link/store conditional) pairs
			
 
				+  - Generic enough infrastructure to support all guest architectures
			
 
				+CURRENT OPEN QUESTIONS:
			
 
				+  - How problematic is the ABA problem in general?
			
 
				+
			
 
				+(Current solution)
			
 
				+
			
 
				+The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
			
 
				+can be used directly or combined to emulate other instructions like
			
 
				+ARM's ldrex/strex instructions. While they are susceptible to the ABA
			
 
				+problem so far common guests have not implemented patterns where
			
 
				+this may be a problem - typically presenting a locking ABI which
			
 
				+assumes cmpxchg like semantics.
			
 
				+
			
 
				+The code also includes a fall-back for cases where multi-threaded TCG
			
 
				+ops can't work (e.g. guest atomic width > host atomic width). In this
			
 
				+case an EXCP_ATOMIC exit occurs and the instruction is emulated with
			
 
				+an exclusive lock which ensures all emulation is serialised.
			
 
				+
			
 
				+While the atomic helpers look good enough for now there may be a need
			
 
				+to look at solutions that can more closely model the guest
			
 
				+architectures semantics.
			
 
				+
			
 
				+==========
			
 
				+
			
 
				+[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt