multi-thread-tcg.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369
  1. ..
  2. Copyright (c) 2015-2020 Linaro Ltd.
  3. This work is licensed under the terms of the GNU GPL, version 2 or
  4. later. See the COPYING file in the top-level directory.
  5. .. _mttcg:
  6. ==================
  7. Multi-threaded TCG
  8. ==================
  9. This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
  10. system-mode emulation. user-mode emulation has always mirrored the
  11. thread structure of the translated executable although some of the
  12. changes done for MTTCG system emulation have improved the stability of
  13. linux-user emulation.
  14. The original system-mode TCG implementation was single threaded and
  15. dealt with multiple CPUs with simple round-robin scheduling. This
  16. simplified a lot of things but became increasingly limited as systems
  17. being emulated gained additional cores and per-core performance gains
  18. for host systems started to level off.
  19. vCPU Scheduling
  20. ===============
  21. We introduce a new running mode where each vCPU will run on its own
  22. user-space thread. This is enabled by default for all FE/BE
  23. combinations where the host memory model is able to accommodate the
  24. guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
  25. guest has had the required work done to support this safely
  26. (TARGET_SUPPORTS_MTTCG).
  27. System emulation will fall back to the original round robin approach
  28. if:
  29. * forced by --accel tcg,thread=single
  30. * enabling --icount mode
  31. In the general case of running translated code there should be no
  32. inter-vCPU dependencies and all vCPUs should be able to run at full
  33. speed. Synchronisation will only be required while accessing internal
  34. shared data structures or when the emulated architecture requires a
  35. coherent representation of the emulated machine state.
  36. Shared Data Structures
  37. ======================
  38. Main Run Loop
  39. -------------
  40. Even when there is no code being generated there are a number of
  41. structures associated with the hot-path through the main run-loop.
  42. These are associated with looking up the next translation block to
  43. execute. These include:
  44. tb_jmp_cache (per-vCPU, cache of recent jumps)
  45. tb_ctx.htable (global hash table, phys address->tb lookup)
  46. As TB linking only occurs when blocks are in the same page this code
  47. is critical to performance as looking up the next TB to execute is the
  48. most common reason to exit the generated code.
  49. DESIGN REQUIREMENT: Make access to lookup structures safe with
  50. multiple reader/writer threads. Minimise any lock contention to do it.
  51. The hot-path avoids using locks where possible. The tb_jmp_cache is
  52. updated with atomic accesses to ensure consistent results. The fall
  53. back QHT based hash table is also designed for lockless lookups. Locks
  54. are only taken when code generation is required or TranslationBlocks
  55. have their block-to-block jumps patched.
  56. Global TCG State
  57. ----------------
  58. User-mode emulation
  59. ~~~~~~~~~~~~~~~~~~~
  60. We need to protect the entire code generation cycle including any post
  61. generation patching of the translated code. This also implies a shared
  62. translation buffer which contains code running on all cores. Any
  63. execution path that comes to the main run loop will need to hold a
  64. mutex for code generation. This also includes times when we need flush
  65. code or entries from any shared lookups/caches. Structures held on a
  66. per-vCPU basis won't need locking unless other vCPUs will need to
  67. modify them.
  68. DESIGN REQUIREMENT: Add locking around all code generation and TB
  69. patching.
  70. (Current solution)
  71. Code generation is serialised with mmap_lock().
  72. !User-mode emulation
  73. ~~~~~~~~~~~~~~~~~~~~
  74. Each vCPU has its own TCG context and associated TCG region, thereby
  75. requiring no locking during translation.
  76. Translation Blocks
  77. ------------------
  78. Currently the whole system shares a single code generation buffer
  79. which when full will force a flush of all translations and start from
  80. scratch again. Some operations also force a full flush of translations
  81. including:
  82. - debugging operations (breakpoint insertion/removal)
  83. - some CPU helper functions
  84. - linux-user spawning its first thread
  85. - operations related to TCG Plugins
  86. This is done with the async_safe_run_on_cpu() mechanism to ensure all
  87. vCPUs are quiescent when changes are being made to shared global
  88. structures.
  89. More granular translation invalidation events are typically due
  90. to a change of the state of a physical page:
  91. - code modification (self modify code, patching code)
  92. - page changes (new page mapping in linux-user mode)
  93. While setting the invalid flag in a TranslationBlock will stop it
  94. being used when looked up in the hot-path there are a number of other
  95. book-keeping structures that need to be safely cleared.
  96. Any TranslationBlocks which have been patched to jump directly to the
  97. now invalid blocks need the jump patches reversing so they will return
  98. to the C code.
  99. There are a number of look-up caches that need to be properly updated
  100. including the:
  101. - jump lookup cache
  102. - the physical-to-tb lookup hash table
  103. - the global page table
  104. The global page table (l1_map) which provides a multi-level look-up
  105. for PageDesc structures which contain pointers to the start of a
  106. linked list of all Translation Blocks in that page (see page_next).
  107. Both the jump patching and the page cache involve linked lists that
  108. the invalidated TranslationBlock needs to be removed from.
  109. DESIGN REQUIREMENT: Safely handle invalidation of TBs
  110. - safely patch/revert direct jumps
  111. - remove central PageDesc lookup entries
  112. - ensure lookup caches/hashes are safely updated
  113. (Current solution)
  114. The direct jump themselves are updated atomically by the TCG
  115. tb_set_jmp_target() code. Modification to the linked lists that allow
  116. searching for linked pages are done under the protection of tb->jmp_lock,
  117. where tb is the destination block of a jump. Each origin block keeps a
  118. pointer to its destinations so that the appropriate lock can be acquired before
  119. iterating over a jump list.
  120. The global page table is a lockless radix tree; cmpxchg is used
  121. to atomically insert new elements.
  122. The lookup caches are updated atomically and the lookup hash uses QHT
  123. which is designed for concurrent safe lookup.
  124. Parallel code generation is supported. QHT is used at insertion time
  125. as the synchronization point across threads, thereby ensuring that we only
  126. keep track of a single TranslationBlock for each guest code block.
  127. Memory maps and TLBs
  128. --------------------
  129. The memory handling code is fairly critical to the speed of memory
  130. access in the emulated system. The SoftMMU code is designed so the
  131. hot-path can be handled entirely within translated code. This is
  132. handled with a per-vCPU TLB structure which once populated will allow
  133. a series of accesses to the page to occur without exiting the
  134. translated code. It is possible to set flags in the TLB address which
  135. will ensure the slow-path is taken for each access. This can be done
  136. to support:
  137. - Memory regions (dividing up access to PIO, MMIO and RAM)
  138. - Dirty page tracking (for code gen, SMC detection, migration and display)
  139. - Virtual TLB (for translating guest address->real address)
  140. When the TLB tables are updated by a vCPU thread other than their own
  141. we need to ensure it is done in a safe way so no inconsistent state is
  142. seen by the vCPU thread.
  143. Some operations require updating a number of vCPUs TLBs at the same
  144. time in a synchronised manner.
  145. DESIGN REQUIREMENTS:
  146. - TLB Flush All/Page
  147. - can be across-vCPUs
  148. - cross vCPU TLB flush may need other vCPU brought to halt
  149. - change may need to be visible to the calling vCPU immediately
  150. - TLB Flag Update
  151. - usually cross-vCPU
  152. - want change to be visible as soon as possible
  153. - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
  154. - This is a per-vCPU table - by definition can't race
  155. - updated by its own thread when the slow-path is forced
  156. (Current solution)
  157. A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force
  158. synchronisation by setting the source vCPUs work as "safe work" and
  159. exiting the cpu run loop. This ensures that by the time execution
  160. restarts all flush operations have completed.
  161. TLB flag updates are all done atomically and are also protected by the
  162. corresponding page lock.
  163. (Known limitation)
  164. Not really a limitation but the wait mechanism is overly strict for
  165. some architectures which only need flushes completed by a barrier
  166. instruction. This could be a future optimisation.
  167. Emulated hardware state
  168. -----------------------
  169. Currently thanks to KVM work any access to IO memory is automatically protected
  170. by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
  171. to do its own locking.
  172. However IO memory isn't the only way emulated hardware state can be
  173. modified. Some architectures have model specific registers that
  174. trigger hardware emulation features. Generally any translation helper
  175. that needs to update more than a single vCPUs of state should take the
  176. BQL.
  177. As the BQL, or global iothread mutex is shared across the system we
  178. push the use of the lock as far down into the TCG code as possible to
  179. minimise contention.
  180. (Current solution)
  181. MMIO access automatically serialises hardware emulation by way of the
  182. BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
  183. and also defer the reset/startup of vCPUs to the vCPU context by way
  184. of async_run_on_cpu().
  185. Updates to interrupt state are also protected by the BQL as they can
  186. often be cross vCPU.
  187. Memory Consistency
  188. ==================
  189. Between emulated guests and host systems there are a range of memory
  190. consistency models. Even emulating weakly ordered systems on strongly
  191. ordered hosts needs to ensure things like store-after-load re-ordering
  192. can be prevented when the guest wants to.
  193. Memory Barriers
  194. ---------------
  195. Barriers (sometimes known as fences) provide a mechanism for software
  196. to enforce a particular ordering of memory operations from the point
  197. of view of external observers (e.g. another processor core). They can
  198. apply to any memory operations as well as just loads or stores.
  199. The Linux kernel has an excellent `write-up
  200. <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
  201. on the various forms of memory barrier and the guarantees they can
  202. provide.
  203. Barriers are often wrapped around synchronisation primitives to
  204. provide explicit memory ordering semantics. However they can be used
  205. by themselves to provide safe lockless access by ensuring for example
  206. a change to a signal flag will only be visible once the changes to
  207. payload are.
  208. DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
  209. This would enforce a strong load/store ordering so all loads/stores
  210. complete at the memory barrier. On single-core non-SMP strongly
  211. ordered backends this could become a NOP.
  212. Aside from explicit standalone memory barrier instructions there are
  213. also implicit memory ordering semantics which comes with each guest
  214. memory access instruction. For example all x86 load/stores come with
  215. fairly strong guarantees of sequential consistency whereas Arm has
  216. special variants of load/store instructions that imply acquire/release
  217. semantics.
  218. In the case of a strongly ordered guest architecture being emulated on
  219. a weakly ordered host the scope for a heavy performance impact is
  220. quite high.
  221. DESIGN REQUIREMENTS: Be efficient with use of memory barriers
  222. - host systems with stronger implied guarantees can skip some barriers
  223. - merge consecutive barriers to the strongest one
  224. (Current solution)
  225. The system currently has a tcg_gen_mb() which will add memory barrier
  226. operations if code generation is being done in a parallel context. The
  227. tcg_optimize() function attempts to merge barriers up to their
  228. strongest form before any load/store operations. The solution was
  229. originally developed and tested for linux-user based systems. All
  230. backends have been converted to emit fences when required. So far the
  231. following front-ends have been updated to emit fences when required:
  232. - target-i386
  233. - target-arm
  234. - target-aarch64
  235. - target-alpha
  236. - target-mips
  237. Memory Control and Maintenance
  238. ------------------------------
  239. This includes a class of instructions for controlling system cache
  240. behaviour. While QEMU doesn't model cache behaviour these instructions
  241. are often seen when code modification has taken place to ensure the
  242. changes take effect.
  243. Synchronisation Primitives
  244. --------------------------
  245. There are two broad types of synchronisation primitives found in
  246. modern ISAs: atomic instructions and exclusive regions.
  247. The first type offer a simple atomic instruction which will guarantee
  248. some sort of test and conditional store will be truly atomic w.r.t.
  249. other cores sharing access to the memory. The classic example is the
  250. x86 cmpxchg instruction.
  251. The second type offer a pair of load/store instructions which offer a
  252. guarantee that a region of memory has not been touched between the
  253. load and store instructions. An example of this is Arm's ldrex/strex
  254. pair where the strex instruction will return a flag indicating a
  255. successful store only if no other CPU has accessed the memory region
  256. since the ldrex.
  257. Traditionally TCG has generated a series of operations that work
  258. because they are within the context of a single translation block so
  259. will have completed before another CPU is scheduled. However with
  260. the ability to have multiple threads running to emulate multiple CPUs
  261. we will need to explicitly expose these semantics.
  262. DESIGN REQUIREMENTS:
  263. - Support classic atomic instructions
  264. - Support load/store exclusive (or load link/store conditional) pairs
  265. - Generic enough infrastructure to support all guest architectures
  266. CURRENT OPEN QUESTIONS:
  267. - How problematic is the ABA problem in general?
  268. (Current solution)
  269. The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
  270. can be used directly or combined to emulate other instructions like
  271. Arm's ldrex/strex instructions. While they are susceptible to the ABA
  272. problem so far common guests have not implemented patterns where
  273. this may be a problem - typically presenting a locking ABI which
  274. assumes cmpxchg like semantics.
  275. The code also includes a fall-back for cases where multi-threaded TCG
  276. ops can't work (e.g. guest atomic width > host atomic width). In this
  277. case an EXCP_ATOMIC exit occurs and the instruction is emulated with
  278. an exclusive lock which ensures all emulation is serialised.
  279. While the atomic helpers look good enough for now there may be a need
  280. to look at solutions that can more closely model the guest
  281. architectures semantics.