replay.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309
  1. ..
  2. Copyright (c) 2022, ISP RAS
  3. Written by Pavel Dovgalyuk and Alex Bennée
  4. =======================
  5. Execution Record/Replay
  6. =======================
  7. Core concepts
  8. =============
  9. Record/replay functions are used for the deterministic replay of qemu
  10. execution. Execution recording writes a non-deterministic events log, which
  11. can be later used for replaying the execution anywhere and for unlimited
  12. number of times. Execution replaying reads the log and replays all
  13. non-deterministic events including external input, hardware clocks,
  14. and interrupts.
  15. Several parts of QEMU include function calls to make event log recording
  16. and replaying.
  17. Devices' models that have non-deterministic input from external devices were
  18. changed to write every external event into the execution log immediately.
  19. E.g. network packets are written into the log when they arrive into the virtual
  20. network adapter.
  21. All non-deterministic events are coming from these devices. But to
  22. replay them we need to know at which moments they occur. We specify
  23. these moments by counting the number of instructions executed between
  24. every pair of consecutive events.
  25. Academic papers with description of deterministic replay implementation:
  26. * `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_
  27. * `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_
  28. Modifications of qemu include:
  29. * wrappers for clock and time functions to save their return values in the log
  30. * saving different asynchronous events (e.g. system shutdown) into the log
  31. * synchronization of the bottom halves execution
  32. * synchronization of the threads from thread pool
  33. * recording/replaying user input (mouse, keyboard, and microphone)
  34. * adding internal checkpoints for cpu and io synchronization
  35. * network filter for recording and replaying the packets
  36. * block driver for making block layer deterministic
  37. * serial port input record and replay
  38. * recording of random numbers obtained from the external sources
  39. Instruction counting
  40. --------------------
  41. QEMU should work in icount mode to use record/replay feature. icount was
  42. designed to allow deterministic execution in absence of external inputs
  43. of the virtual machine. We also use icount to control the occurrence of the
  44. non-deterministic events. The number of instructions elapsed from the last event
  45. is written to the log while recording the execution. In replay mode we
  46. can predict when to inject that event using the instruction counter.
  47. Locking and thread synchronisation
  48. ----------------------------------
  49. Previously the synchronisation of the main thread and the vCPU thread
  50. was ensured by the holding of the BQL. However the trend has been to
  51. reduce the time the BQL was held across the system including under TCG
  52. system emulation. As it is important that batches of events are kept
  53. in sequence (e.g. expiring timers and checkpoints in the main thread
  54. while instruction checkpoints are written by the vCPU thread) we need
  55. another lock to keep things in lock-step. This role is now handled by
  56. the replay_mutex_lock. It used to be held only for each event being
  57. written but now it is held for a whole execution period. This results
  58. in a deterministic ping-pong between the two main threads.
  59. As the BQL is now a finer grained lock than the replay_lock it is almost
  60. certainly a bug, and a source of deadlocks, to take the
  61. replay_mutex_lock while the BQL is held. This is enforced by an assert.
  62. While the unlocks are usually in the reverse order, this is not
  63. necessary; you can drop the replay_lock while holding the BQL, without
  64. doing a more complicated unlock_iothread/replay_unlock/lock_iothread
  65. sequence.
  66. Checkpoints
  67. -----------
  68. Replaying the execution of virtual machine is bound by sources of
  69. non-determinism. These are inputs from clock and peripheral devices,
  70. and QEMU thread scheduling. Thread scheduling affect on processing events
  71. from timers, asynchronous input-output, and bottom halves.
  72. Invocations of timers are coupled with clock reads and changing the state
  73. of the virtual machine. Reads produce non-deterministic data taken from
  74. host clock. And VM state changes should preserve their order. Their relative
  75. order in replay mode must replicate the order of callbacks in record mode.
  76. To preserve this order we use checkpoints. When a specific clock is processed
  77. in record mode we save to the log special "checkpoint" event.
  78. Checkpoints here do not refer to virtual machine snapshots. They are just
  79. record/replay events used for synchronization.
  80. QEMU in replay mode will try to invoke timers processing in random moment
  81. of time. That's why we do not process a group of timers until the checkpoint
  82. event will be read from the log. Such an event allows synchronizing CPU
  83. execution and timer events.
  84. Two other checkpoints govern the "warping" of the virtual clock.
  85. While the virtual machine is idle, the virtual clock increments at
  86. 1 ns per *real time* nanosecond. This is done by setting up a timer
  87. (called the warp timer) on the virtual real time clock, so that the
  88. timer fires at the next deadline of the virtual clock; the virtual clock
  89. is then incremented (which is called "warping" the virtual clock) as
  90. soon as the timer fires or the CPUs need to go out of the idle state.
  91. Two functions are used for this purpose; because these actions change
  92. virtual machine state and must be deterministic, each of them creates a
  93. checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so
  94. starts accounting real time to virtual clock. ``icount_account_warp_timer``
  95. is called when the CPUs get an interrupt or when the warp timer fires,
  96. and it warps the virtual clock by the amount of real time that has passed
  97. since ``icount_start_warp_timer``.
  98. Virtual devices
  99. ===============
  100. Record/replay mechanism, that could be enabled through icount mode, expects
  101. the virtual devices to satisfy the following requirement:
  102. everything that affects
  103. the guest state during execution in icount mode should be deterministic.
  104. Timers
  105. ------
  106. Timers are used to execute callbacks from different subsystems of QEMU
  107. at the specified moments of time. There are several kinds of timers:
  108. * Real time clock. Based on host time and used only for callbacks that
  109. do not change the virtual machine state. For this reason real time
  110. clock and timers does not affect deterministic replay at all.
  111. * Virtual clock. These timers run only during the emulation. In icount
  112. mode virtual clock value is calculated using executed instructions counter.
  113. That is why it is completely deterministic and does not have to be recorded.
  114. * Host clock. This clock is used by device models that simulate real time
  115. sources (e.g. real time clock chip). Host clock is the one of the sources
  116. of non-determinism. Host clock read operations should be logged to
  117. make the execution deterministic.
  118. * Virtual real time clock. This clock is similar to real time clock but
  119. it is used only for increasing virtual clock while virtual machine is
  120. sleeping. Due to its nature it is also non-deterministic as the host clock
  121. and has to be logged too.
  122. All virtual devices should use virtual clock for timers that change the guest
  123. state. Virtual clock is deterministic, therefore such timers are deterministic
  124. too.
  125. Virtual devices can also use realtime clock for the events that do not change
  126. the guest state directly. When the clock ticking should depend on VM execution
  127. speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
  128. but its speed depends on the guest execution. This clock is used by
  129. the virtual devices (e.g., slirp routing device) that lie outside the
  130. replayed guest.
  131. Block devices
  132. -------------
  133. Block devices record/replay module (``blkreplay``) intercepts calls of
  134. bdrv coroutine functions at the top of block drivers stack.
  135. All block completion operations are added to the queue in the coroutines.
  136. When the queue is flushed the information about processed requests
  137. is recorded to the log. In replay phase the queue is matched with
  138. events read from the log. Therefore block devices requests are processed
  139. deterministically.
  140. Bottom halves
  141. -------------
  142. Bottom half callbacks, that affect the guest state, should be invoked through
  143. ``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions.
  144. Their invocations are saved in record mode and synchronized with the existing
  145. log in replay mode.
  146. Disk I/O events are completely deterministic in our model, because
  147. in both record and replay modes we start virtual machine from the same
  148. disk state. But callbacks that virtual disk controller uses for reading and
  149. writing the disk may occur at different moments of time in record and replay
  150. modes.
  151. Reading and writing requests are created by CPU thread of QEMU. Later these
  152. requests proceed to block layer which creates "bottom halves". Bottom
  153. halves consist of callback and its parameters. They are processed when
  154. main loop locks the BQL. These locks are not synchronized with
  155. replaying process because main loop also processes the events that do not
  156. affect the virtual machine state (like user interaction with monitor).
  157. That is why we had to implement saving and replaying bottom halves callbacks
  158. synchronously to the CPU execution. When the callback is about to execute
  159. it is added to the queue in the replay module. This queue is written to the
  160. log when its callbacks are executed. In replay mode callbacks are not processed
  161. until the corresponding event is read from the events log file.
  162. Sometimes the block layer uses asynchronous callbacks for its internal purposes
  163. (like reading or writing VM snapshots or disk image cluster tables). In this
  164. case bottom halves are not marked as "replayable" and do not saved
  165. into the log.
  166. Saving/restoring the VM state
  167. -----------------------------
  168. Record/replay relies on VM state save and restore being complete and
  169. deterministic.
  170. All fields in the device state structure (including virtual timers)
  171. should be restored by loadvm to the same values they had before savevm.
  172. Avoid accessing other devices' state, because the order of saving/restoring
  173. is not defined. It means that you should not call functions like
  174. ``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid
  175. the dependencies that may make restoring the VM state non-deterministic.
  176. Stopping the VM
  177. ---------------
  178. Stopping the guest should not interfere with its state (with the exception
  179. of the network connections, that could be broken by the remote timeouts).
  180. VM can be stopped at any moment of replay by the user. Restarting the VM
  181. after that stop should not break the replay by the unneeded guest state change.
  182. Replay log format
  183. =================
  184. Record/replay log consists of the header and the sequence of execution
  185. events. The header includes 4-byte replay version id and 8-byte reserved
  186. field. Version is updated every time replay log format changes to prevent
  187. using replay log created by another build of qemu.
  188. The sequence of the events describes virtual machine state changes.
  189. It includes all non-deterministic inputs of VM, synchronization marks and
  190. instruction counts used to correctly inject inputs at replay.
  191. Synchronization marks (checkpoints) are used for synchronizing qemu threads
  192. that perform operations with virtual hardware. These operations may change
  193. system's state (e.g., change some register or generate interrupt) and
  194. therefore should execute synchronously with CPU thread.
  195. Every event in the log includes 1-byte event id and optional arguments.
  196. When argument is an array, it is stored as 4-byte array length
  197. and corresponding number of bytes with data.
  198. Here is the list of events that are written into the log:
  199. - EVENT_INSTRUCTION. Instructions executed since last event. Followed by:
  200. - 4-byte number of executed instructions.
  201. - EVENT_INTERRUPT. Used to synchronize interrupt processing.
  202. - EVENT_EXCEPTION. Used to synchronize exception handling.
  203. - EVENT_ASYNC. This is a group of events. When such an event is generated,
  204. it is stored in the queue and processed in icount_account_warp_timer().
  205. Every such event has it's own id from the following list:
  206. - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
  207. callbacks that affect virtual machine state, but normally called
  208. asynchronously. Followed by:
  209. - 8-byte operation id.
  210. - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
  211. parameters of keyboard and mouse input operations
  212. (key press/release, mouse pointer movement). Followed by:
  213. - 9-16 bytes depending of input event.
  214. - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
  215. - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
  216. initiated by the sender. Followed by:
  217. - 1-byte character device id.
  218. - Array with bytes were read.
  219. - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
  220. operations with disk and flash drives with CPU. Followed by:
  221. - 8-byte operation id.
  222. - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by:
  223. - 1-byte network adapter id.
  224. - 4-byte packet flags.
  225. - Array with packet bytes.
  226. - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
  227. e.g., by closing the window.
  228. - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by:
  229. - 4-byte output function return value.
  230. - 4-byte offset in the output array.
  231. - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
  232. initiated by qemu. Followed by:
  233. - Array with bytes that were read.
  234. - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
  235. initiated by qemu. Followed by:
  236. - 4-byte error code.
  237. - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by:
  238. - 8-byte clock value.
  239. - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
  240. CPU, internal threads, and asynchronous input events.
  241. - EVENT_END. Last event in the log.