postcopy.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314
  1. ========
  2. Postcopy
  3. ========
  4. .. contents::
  5. 'Postcopy' migration is a way to deal with migrations that refuse to converge
  6. (or take too long to converge) its plus side is that there is an upper bound on
  7. the amount of migration traffic and time it takes, the down side is that during
  8. the postcopy phase, a failure of *either* side causes the guest to be lost.
  9. In postcopy the destination CPUs are started before all the memory has been
  10. transferred, and accesses to pages that are yet to be transferred cause
  11. a fault that's translated by QEMU into a request to the source QEMU.
  12. Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
  13. doesn't finish in a given time the switch is made to postcopy.
  14. Enabling postcopy
  15. =================
  16. To enable postcopy, issue this command on the monitor (both source and
  17. destination) prior to the start of migration:
  18. ``migrate_set_capability postcopy-ram on``
  19. The normal commands are then used to start a migration, which is still
  20. started in precopy mode. Issuing:
  21. ``migrate_start_postcopy``
  22. will now cause the transition from precopy to postcopy.
  23. It can be issued immediately after migration is started or any
  24. time later on. Issuing it after the end of a migration is harmless.
  25. Blocktime is a postcopy live migration metric, intended to show how
  26. long the vCPU was in state of interruptible sleep due to pagefault.
  27. That metric is calculated both for all vCPUs as overlapped value, and
  28. separately for each vCPU. These values are calculated on destination
  29. side. To enable postcopy blocktime calculation, enter following
  30. command on destination monitor:
  31. ``migrate_set_capability postcopy-blocktime on``
  32. Postcopy blocktime can be retrieved by query-migrate qmp command.
  33. postcopy-blocktime value of qmp command will show overlapped blocking
  34. time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
  35. time per vCPU.
  36. .. note::
  37. During the postcopy phase, the bandwidth limits set using
  38. ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
  39. the destination is waiting for).
  40. Postcopy internals
  41. ==================
  42. State machine
  43. -------------
  44. Postcopy moves through a series of states (see postcopy_state) from
  45. ADVISE->DISCARD->LISTEN->RUNNING->END
  46. - Advise
  47. Set at the start of migration if postcopy is enabled, even
  48. if it hasn't had the start command; here the destination
  49. checks that its OS has the support needed for postcopy, and performs
  50. setup to ensure the RAM mappings are suitable for later postcopy.
  51. The destination will fail early in migration at this point if the
  52. required OS support is not present.
  53. (Triggered by reception of POSTCOPY_ADVISE command)
  54. - Discard
  55. Entered on receipt of the first 'discard' command; prior to
  56. the first Discard being performed, hugepages are switched off
  57. (using madvise) to ensure that no new huge pages are created
  58. during the postcopy phase, and to cause any huge pages that
  59. have discards on them to be broken.
  60. - Listen
  61. The first command in the package, POSTCOPY_LISTEN, switches
  62. the destination state to Listen, and starts a new thread
  63. (the 'listen thread') which takes over the job of receiving
  64. pages off the migration stream, while the main thread carries
  65. on processing the blob. With this thread able to process page
  66. reception, the destination now 'sensitises' the RAM to detect
  67. any access to missing pages (on Linux using the 'userfault'
  68. system).
  69. - Running
  70. POSTCOPY_RUN causes the destination to synchronise all
  71. state and start the CPUs and IO devices running. The main
  72. thread now finishes processing the migration package and
  73. now carries on as it would for normal precopy migration
  74. (although it can't do the cleanup it would do as it
  75. finishes a normal migration).
  76. - End
  77. The listen thread can now quit, and perform the cleanup of migration
  78. state, the migration is now complete.
  79. Device transfer
  80. ---------------
  81. Loading of device data may cause the device emulation to access guest RAM
  82. that may trigger faults that have to be resolved by the source, as such
  83. the migration stream has to be able to respond with page data *during* the
  84. device load, and hence the device data has to be read from the stream completely
  85. before the device load begins to free the stream up. This is achieved by
  86. 'packaging' the device data into a blob that's read in one go.
  87. Source behaviour
  88. ----------------
  89. Until postcopy is entered the migration stream is identical to normal
  90. precopy, except for the addition of a 'postcopy advise' command at
  91. the beginning, to tell the destination that postcopy might happen.
  92. When postcopy starts the source sends the page discard data and then
  93. forms the 'package' containing:
  94. - Command: 'postcopy listen'
  95. - The device state
  96. A series of sections, identical to the precopy streams device state stream
  97. containing everything except postcopiable devices (i.e. RAM)
  98. - Command: 'postcopy run'
  99. The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
  100. contents are formatted in the same way as the main migration stream.
  101. During postcopy the source scans the list of dirty pages and sends them
  102. to the destination without being requested (in much the same way as precopy),
  103. however when a page request is received from the destination, the dirty page
  104. scanning restarts from the requested location. This causes requested pages
  105. to be sent quickly, and also causes pages directly after the requested page
  106. to be sent quickly in the hope that those pages are likely to be used
  107. by the destination soon.
  108. Destination behaviour
  109. ---------------------
  110. Initially the destination looks the same as precopy, with a single thread
  111. reading the migration stream; the 'postcopy advise' and 'discard' commands
  112. are processed to change the way RAM is managed, but don't affect the stream
  113. processing.
  114. ::
  115. ------------------------------------------------------------------------------
  116. 1 2 3 4 5 6 7
  117. main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
  118. thread | |
  119. | (page request)
  120. | \___
  121. v \
  122. listen thread: --- page -- page -- page -- page -- page --
  123. a b c
  124. ------------------------------------------------------------------------------
  125. - On receipt of ``CMD_PACKAGED`` (1)
  126. All the data associated with the package - the ( ... ) section in the diagram -
  127. is read into memory, and the main thread recurses into qemu_loadvm_state_main
  128. to process the contents of the package (2) which contains commands (3,6) and
  129. devices (4...)
  130. - On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
  131. a new thread (a) is started that takes over servicing the migration stream,
  132. while the main thread carries on loading the package. It loads normal
  133. background page data (b) but if during a device load a fault happens (5)
  134. the returned page (c) is loaded by the listen thread allowing the main
  135. threads device load to carry on.
  136. - The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
  137. letting the destination CPUs start running. At the end of the
  138. ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
  139. is no longer used by migration, while the listen thread carries on servicing
  140. page data until the end of migration.
  141. Source side page bitmap
  142. -----------------------
  143. The 'migration bitmap' in postcopy is basically the same as in the precopy,
  144. where each of the bit to indicate that page is 'dirty' - i.e. needs
  145. sending. During the precopy phase this is updated as the CPU dirties
  146. pages, however during postcopy the CPUs are stopped and nothing should
  147. dirty anything any more. Instead, dirty bits are cleared when the relevant
  148. pages are sent during postcopy.
  149. Postcopy features
  150. =================
  151. Postcopy recovery
  152. -----------------
  153. Comparing to precopy, postcopy is special on error handlings. When any
  154. error happens (in this case, mostly network errors), QEMU cannot easily
  155. fail a migration because VM data resides in both source and destination
  156. QEMU instances. On the other hand, when issue happens QEMU on both sides
  157. will go into a paused state. It'll need a recovery phase to continue a
  158. paused postcopy migration.
  159. The recovery phase normally contains a few steps:
  160. - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
  161. migration state.
  162. - When the network is recovered (or a new network is provided), the admin
  163. can setup the new channel for migration using QMP command
  164. 'migrate-recover' on destination node, preparing for a resume.
  165. - On source host, the admin can continue the interrupted postcopy
  166. migration using QMP command 'migrate' with resume=true flag set.
  167. Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
  168. re-establish the channels.
  169. - When both sides of QEMU successfully reconnect using a new or fixed up
  170. channel, they will go into **POSTCOPY_RECOVER** state, some handshake
  171. procedure will be needed to properly synchronize the VM states between
  172. the two QEMUs to continue the postcopy migration. For example, there
  173. can be pages sent right during the window when the network is
  174. interrupted, then the handshake will guarantee pages lost in-flight
  175. will be resent again.
  176. - After a proper handshake synchronization, QEMU will continue the
  177. postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
  178. state. Postcopy migration will continue.
  179. During a paused postcopy migration, the VM can logically still continue
  180. running, and it will not be impacted from any page access to pages that
  181. were already migrated to destination VM before the interruption happens.
  182. However, if any of the missing pages got accessed on destination VM, the VM
  183. thread will be halted waiting for the page to be migrated, it means it can
  184. be halted until the recovery is complete.
  185. The impact of accessing missing pages can be relevant to different
  186. configurations of the guest. For example, when with async page fault
  187. enabled, logically the guest can proactively schedule out the threads
  188. accessing missing pages.
  189. Postcopy with hugepages
  190. -----------------------
  191. Postcopy now works with hugetlbfs backed memory:
  192. a) The linux kernel on the destination must support userfault on hugepages.
  193. b) The huge-page configuration on the source and destination VMs must be
  194. identical; i.e. RAMBlocks on both sides must use the same page size.
  195. c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
  196. RAM if it doesn't have enough hugepages, triggering (b) to fail.
  197. Using ``-mem-prealloc`` enforces the allocation using hugepages.
  198. d) Care should be taken with the size of hugepage used; postcopy with 2MB
  199. hugepages works well, however 1GB hugepages are likely to be problematic
  200. since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
  201. and until the full page is transferred the destination thread is blocked.
  202. Postcopy with shared memory
  203. ---------------------------
  204. Postcopy migration with shared memory needs explicit support from the other
  205. processes that share memory and from QEMU. There are restrictions on the type of
  206. memory that userfault can support shared.
  207. The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
  208. (although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
  209. for hugetlbfs which may be a problem in some configurations).
  210. The vhost-user code in QEMU supports clients that have Postcopy support,
  211. and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
  212. to support postcopy.
  213. The client needs to open a userfaultfd and register the areas
  214. of memory that it maps with userfault. The client must then pass the
  215. userfaultfd back to QEMU together with a mapping table that allows
  216. fault addresses in the clients address space to be converted back to
  217. RAMBlock/offsets. The client's userfaultfd is added to the postcopy
  218. fault-thread and page requests are made on behalf of the client by QEMU.
  219. QEMU performs 'wake' operations on the client's userfaultfd to allow it
  220. to continue after a page has arrived.
  221. .. note::
  222. There are two future improvements that would be nice:
  223. a) Some way to make QEMU ignorant of the addresses in the clients
  224. address space
  225. b) Avoiding the need for QEMU to perform ufd-wake calls after the
  226. pages have arrived
  227. Retro-fitting postcopy to existing clients is possible:
  228. a) A mechanism is needed for the registration with userfault as above,
  229. and the registration needs to be coordinated with the phases of
  230. postcopy. In vhost-user extra messages are added to the existing
  231. control channel.
  232. b) Any thread that can block due to guest memory accesses must be
  233. identified and the implication understood; for example if the
  234. guest memory access is made while holding a lock then all other
  235. threads waiting for that lock will also be blocked.
  236. Postcopy preemption mode
  237. ------------------------
  238. Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
  239. allows urgent pages (those got page fault requested from destination QEMU
  240. explicitly) to be sent in a separate preempt channel, rather than queued in
  241. the background migration channel. Anyone who cares about latencies of page
  242. faults during a postcopy migration should enable this feature. By default,
  243. it's not enabled.