|
@@ -644,308 +644,3 @@ algorithm will restrict virtual CPUs as needed to keep their dirty page
|
|
rate inside the limit. This leads to more steady reading performance during
|
|
rate inside the limit. This leads to more steady reading performance during
|
|
live migration and can aid in improving large guest responsiveness.
|
|
live migration and can aid in improving large guest responsiveness.
|
|
|
|
|
|
-Postcopy
|
|
|
|
-========
|
|
|
|
-
|
|
|
|
-'Postcopy' migration is a way to deal with migrations that refuse to converge
|
|
|
|
-(or take too long to converge) its plus side is that there is an upper bound on
|
|
|
|
-the amount of migration traffic and time it takes, the down side is that during
|
|
|
|
-the postcopy phase, a failure of *either* side causes the guest to be lost.
|
|
|
|
-
|
|
|
|
-In postcopy the destination CPUs are started before all the memory has been
|
|
|
|
-transferred, and accesses to pages that are yet to be transferred cause
|
|
|
|
-a fault that's translated by QEMU into a request to the source QEMU.
|
|
|
|
-
|
|
|
|
-Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
|
|
|
|
-doesn't finish in a given time the switch is made to postcopy.
|
|
|
|
-
|
|
|
|
-Enabling postcopy
|
|
|
|
------------------
|
|
|
|
-
|
|
|
|
-To enable postcopy, issue this command on the monitor (both source and
|
|
|
|
-destination) prior to the start of migration:
|
|
|
|
-
|
|
|
|
-``migrate_set_capability postcopy-ram on``
|
|
|
|
-
|
|
|
|
-The normal commands are then used to start a migration, which is still
|
|
|
|
-started in precopy mode. Issuing:
|
|
|
|
-
|
|
|
|
-``migrate_start_postcopy``
|
|
|
|
-
|
|
|
|
-will now cause the transition from precopy to postcopy.
|
|
|
|
-It can be issued immediately after migration is started or any
|
|
|
|
-time later on. Issuing it after the end of a migration is harmless.
|
|
|
|
-
|
|
|
|
-Blocktime is a postcopy live migration metric, intended to show how
|
|
|
|
-long the vCPU was in state of interruptible sleep due to pagefault.
|
|
|
|
-That metric is calculated both for all vCPUs as overlapped value, and
|
|
|
|
-separately for each vCPU. These values are calculated on destination
|
|
|
|
-side. To enable postcopy blocktime calculation, enter following
|
|
|
|
-command on destination monitor:
|
|
|
|
-
|
|
|
|
-``migrate_set_capability postcopy-blocktime on``
|
|
|
|
-
|
|
|
|
-Postcopy blocktime can be retrieved by query-migrate qmp command.
|
|
|
|
-postcopy-blocktime value of qmp command will show overlapped blocking
|
|
|
|
-time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
|
|
|
|
-time per vCPU.
|
|
|
|
-
|
|
|
|
-.. note::
|
|
|
|
- During the postcopy phase, the bandwidth limits set using
|
|
|
|
- ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
|
|
|
|
- the destination is waiting for).
|
|
|
|
-
|
|
|
|
-Postcopy device transfer
|
|
|
|
-------------------------
|
|
|
|
-
|
|
|
|
-Loading of device data may cause the device emulation to access guest RAM
|
|
|
|
-that may trigger faults that have to be resolved by the source, as such
|
|
|
|
-the migration stream has to be able to respond with page data *during* the
|
|
|
|
-device load, and hence the device data has to be read from the stream completely
|
|
|
|
-before the device load begins to free the stream up. This is achieved by
|
|
|
|
-'packaging' the device data into a blob that's read in one go.
|
|
|
|
-
|
|
|
|
-Source behaviour
|
|
|
|
-----------------
|
|
|
|
-
|
|
|
|
-Until postcopy is entered the migration stream is identical to normal
|
|
|
|
-precopy, except for the addition of a 'postcopy advise' command at
|
|
|
|
-the beginning, to tell the destination that postcopy might happen.
|
|
|
|
-When postcopy starts the source sends the page discard data and then
|
|
|
|
-forms the 'package' containing:
|
|
|
|
-
|
|
|
|
- - Command: 'postcopy listen'
|
|
|
|
- - The device state
|
|
|
|
-
|
|
|
|
- A series of sections, identical to the precopy streams device state stream
|
|
|
|
- containing everything except postcopiable devices (i.e. RAM)
|
|
|
|
- - Command: 'postcopy run'
|
|
|
|
-
|
|
|
|
-The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
|
|
|
|
-contents are formatted in the same way as the main migration stream.
|
|
|
|
-
|
|
|
|
-During postcopy the source scans the list of dirty pages and sends them
|
|
|
|
-to the destination without being requested (in much the same way as precopy),
|
|
|
|
-however when a page request is received from the destination, the dirty page
|
|
|
|
-scanning restarts from the requested location. This causes requested pages
|
|
|
|
-to be sent quickly, and also causes pages directly after the requested page
|
|
|
|
-to be sent quickly in the hope that those pages are likely to be used
|
|
|
|
-by the destination soon.
|
|
|
|
-
|
|
|
|
-Destination behaviour
|
|
|
|
----------------------
|
|
|
|
-
|
|
|
|
-Initially the destination looks the same as precopy, with a single thread
|
|
|
|
-reading the migration stream; the 'postcopy advise' and 'discard' commands
|
|
|
|
-are processed to change the way RAM is managed, but don't affect the stream
|
|
|
|
-processing.
|
|
|
|
-
|
|
|
|
-::
|
|
|
|
-
|
|
|
|
- ------------------------------------------------------------------------------
|
|
|
|
- 1 2 3 4 5 6 7
|
|
|
|
- main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
|
|
|
|
- thread | |
|
|
|
|
- | (page request)
|
|
|
|
- | \___
|
|
|
|
- v \
|
|
|
|
- listen thread: --- page -- page -- page -- page -- page --
|
|
|
|
-
|
|
|
|
- a b c
|
|
|
|
- ------------------------------------------------------------------------------
|
|
|
|
-
|
|
|
|
-- On receipt of ``CMD_PACKAGED`` (1)
|
|
|
|
-
|
|
|
|
- All the data associated with the package - the ( ... ) section in the diagram -
|
|
|
|
- is read into memory, and the main thread recurses into qemu_loadvm_state_main
|
|
|
|
- to process the contents of the package (2) which contains commands (3,6) and
|
|
|
|
- devices (4...)
|
|
|
|
-
|
|
|
|
-- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
|
|
|
|
-
|
|
|
|
- a new thread (a) is started that takes over servicing the migration stream,
|
|
|
|
- while the main thread carries on loading the package. It loads normal
|
|
|
|
- background page data (b) but if during a device load a fault happens (5)
|
|
|
|
- the returned page (c) is loaded by the listen thread allowing the main
|
|
|
|
- threads device load to carry on.
|
|
|
|
-
|
|
|
|
-- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
|
|
|
|
-
|
|
|
|
- letting the destination CPUs start running. At the end of the
|
|
|
|
- ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
|
|
|
|
- is no longer used by migration, while the listen thread carries on servicing
|
|
|
|
- page data until the end of migration.
|
|
|
|
-
|
|
|
|
-Postcopy Recovery
|
|
|
|
------------------
|
|
|
|
-
|
|
|
|
-Comparing to precopy, postcopy is special on error handlings. When any
|
|
|
|
-error happens (in this case, mostly network errors), QEMU cannot easily
|
|
|
|
-fail a migration because VM data resides in both source and destination
|
|
|
|
-QEMU instances. On the other hand, when issue happens QEMU on both sides
|
|
|
|
-will go into a paused state. It'll need a recovery phase to continue a
|
|
|
|
-paused postcopy migration.
|
|
|
|
-
|
|
|
|
-The recovery phase normally contains a few steps:
|
|
|
|
-
|
|
|
|
- - When network issue occurs, both QEMU will go into PAUSED state
|
|
|
|
-
|
|
|
|
- - When the network is recovered (or a new network is provided), the admin
|
|
|
|
- can setup the new channel for migration using QMP command
|
|
|
|
- 'migrate-recover' on destination node, preparing for a resume.
|
|
|
|
-
|
|
|
|
- - On source host, the admin can continue the interrupted postcopy
|
|
|
|
- migration using QMP command 'migrate' with resume=true flag set.
|
|
|
|
-
|
|
|
|
- - After the connection is re-established, QEMU will continue the postcopy
|
|
|
|
- migration on both sides.
|
|
|
|
-
|
|
|
|
-During a paused postcopy migration, the VM can logically still continue
|
|
|
|
-running, and it will not be impacted from any page access to pages that
|
|
|
|
-were already migrated to destination VM before the interruption happens.
|
|
|
|
-However, if any of the missing pages got accessed on destination VM, the VM
|
|
|
|
-thread will be halted waiting for the page to be migrated, it means it can
|
|
|
|
-be halted until the recovery is complete.
|
|
|
|
-
|
|
|
|
-The impact of accessing missing pages can be relevant to different
|
|
|
|
-configurations of the guest. For example, when with async page fault
|
|
|
|
-enabled, logically the guest can proactively schedule out the threads
|
|
|
|
-accessing missing pages.
|
|
|
|
-
|
|
|
|
-Postcopy states
|
|
|
|
----------------
|
|
|
|
-
|
|
|
|
-Postcopy moves through a series of states (see postcopy_state) from
|
|
|
|
-ADVISE->DISCARD->LISTEN->RUNNING->END
|
|
|
|
-
|
|
|
|
- - Advise
|
|
|
|
-
|
|
|
|
- Set at the start of migration if postcopy is enabled, even
|
|
|
|
- if it hasn't had the start command; here the destination
|
|
|
|
- checks that its OS has the support needed for postcopy, and performs
|
|
|
|
- setup to ensure the RAM mappings are suitable for later postcopy.
|
|
|
|
- The destination will fail early in migration at this point if the
|
|
|
|
- required OS support is not present.
|
|
|
|
- (Triggered by reception of POSTCOPY_ADVISE command)
|
|
|
|
-
|
|
|
|
- - Discard
|
|
|
|
-
|
|
|
|
- Entered on receipt of the first 'discard' command; prior to
|
|
|
|
- the first Discard being performed, hugepages are switched off
|
|
|
|
- (using madvise) to ensure that no new huge pages are created
|
|
|
|
- during the postcopy phase, and to cause any huge pages that
|
|
|
|
- have discards on them to be broken.
|
|
|
|
-
|
|
|
|
- - Listen
|
|
|
|
-
|
|
|
|
- The first command in the package, POSTCOPY_LISTEN, switches
|
|
|
|
- the destination state to Listen, and starts a new thread
|
|
|
|
- (the 'listen thread') which takes over the job of receiving
|
|
|
|
- pages off the migration stream, while the main thread carries
|
|
|
|
- on processing the blob. With this thread able to process page
|
|
|
|
- reception, the destination now 'sensitises' the RAM to detect
|
|
|
|
- any access to missing pages (on Linux using the 'userfault'
|
|
|
|
- system).
|
|
|
|
-
|
|
|
|
- - Running
|
|
|
|
-
|
|
|
|
- POSTCOPY_RUN causes the destination to synchronise all
|
|
|
|
- state and start the CPUs and IO devices running. The main
|
|
|
|
- thread now finishes processing the migration package and
|
|
|
|
- now carries on as it would for normal precopy migration
|
|
|
|
- (although it can't do the cleanup it would do as it
|
|
|
|
- finishes a normal migration).
|
|
|
|
-
|
|
|
|
- - Paused
|
|
|
|
-
|
|
|
|
- Postcopy can run into a paused state (normally on both sides when
|
|
|
|
- happens), where all threads will be temporarily halted mostly due to
|
|
|
|
- network errors. When reaching paused state, migration will make sure
|
|
|
|
- the qemu binary on both sides maintain the data without corrupting
|
|
|
|
- the VM. To continue the migration, the admin needs to fix the
|
|
|
|
- migration channel using the QMP command 'migrate-recover' on the
|
|
|
|
- destination node, then resume the migration using QMP command 'migrate'
|
|
|
|
- again on source node, with resume=true flag set.
|
|
|
|
-
|
|
|
|
- - End
|
|
|
|
-
|
|
|
|
- The listen thread can now quit, and perform the cleanup of migration
|
|
|
|
- state, the migration is now complete.
|
|
|
|
-
|
|
|
|
-Source side page map
|
|
|
|
---------------------
|
|
|
|
-
|
|
|
|
-The 'migration bitmap' in postcopy is basically the same as in the precopy,
|
|
|
|
-where each of the bit to indicate that page is 'dirty' - i.e. needs
|
|
|
|
-sending. During the precopy phase this is updated as the CPU dirties
|
|
|
|
-pages, however during postcopy the CPUs are stopped and nothing should
|
|
|
|
-dirty anything any more. Instead, dirty bits are cleared when the relevant
|
|
|
|
-pages are sent during postcopy.
|
|
|
|
-
|
|
|
|
-Postcopy with hugepages
|
|
|
|
------------------------
|
|
|
|
-
|
|
|
|
-Postcopy now works with hugetlbfs backed memory:
|
|
|
|
-
|
|
|
|
- a) The linux kernel on the destination must support userfault on hugepages.
|
|
|
|
- b) The huge-page configuration on the source and destination VMs must be
|
|
|
|
- identical; i.e. RAMBlocks on both sides must use the same page size.
|
|
|
|
- c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
|
|
|
|
- RAM if it doesn't have enough hugepages, triggering (b) to fail.
|
|
|
|
- Using ``-mem-prealloc`` enforces the allocation using hugepages.
|
|
|
|
- d) Care should be taken with the size of hugepage used; postcopy with 2MB
|
|
|
|
- hugepages works well, however 1GB hugepages are likely to be problematic
|
|
|
|
- since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
|
|
|
|
- and until the full page is transferred the destination thread is blocked.
|
|
|
|
-
|
|
|
|
-Postcopy with shared memory
|
|
|
|
----------------------------
|
|
|
|
-
|
|
|
|
-Postcopy migration with shared memory needs explicit support from the other
|
|
|
|
-processes that share memory and from QEMU. There are restrictions on the type of
|
|
|
|
-memory that userfault can support shared.
|
|
|
|
-
|
|
|
|
-The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
|
|
|
|
-(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
|
|
|
|
-for hugetlbfs which may be a problem in some configurations).
|
|
|
|
-
|
|
|
|
-The vhost-user code in QEMU supports clients that have Postcopy support,
|
|
|
|
-and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
|
|
|
|
-to support postcopy.
|
|
|
|
-
|
|
|
|
-The client needs to open a userfaultfd and register the areas
|
|
|
|
-of memory that it maps with userfault. The client must then pass the
|
|
|
|
-userfaultfd back to QEMU together with a mapping table that allows
|
|
|
|
-fault addresses in the clients address space to be converted back to
|
|
|
|
-RAMBlock/offsets. The client's userfaultfd is added to the postcopy
|
|
|
|
-fault-thread and page requests are made on behalf of the client by QEMU.
|
|
|
|
-QEMU performs 'wake' operations on the client's userfaultfd to allow it
|
|
|
|
-to continue after a page has arrived.
|
|
|
|
-
|
|
|
|
-.. note::
|
|
|
|
- There are two future improvements that would be nice:
|
|
|
|
- a) Some way to make QEMU ignorant of the addresses in the clients
|
|
|
|
- address space
|
|
|
|
- b) Avoiding the need for QEMU to perform ufd-wake calls after the
|
|
|
|
- pages have arrived
|
|
|
|
-
|
|
|
|
-Retro-fitting postcopy to existing clients is possible:
|
|
|
|
- a) A mechanism is needed for the registration with userfault as above,
|
|
|
|
- and the registration needs to be coordinated with the phases of
|
|
|
|
- postcopy. In vhost-user extra messages are added to the existing
|
|
|
|
- control channel.
|
|
|
|
- b) Any thread that can block due to guest memory accesses must be
|
|
|
|
- identified and the implication understood; for example if the
|
|
|
|
- guest memory access is made while holding a lock then all other
|
|
|
|
- threads waiting for that lock will also be blocked.
|
|
|
|
-
|
|
|
|
-Postcopy Preemption Mode
|
|
|
|
-------------------------
|
|
|
|
-
|
|
|
|
-Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
|
|
|
|
-allows urgent pages (those got page fault requested from destination QEMU
|
|
|
|
-explicitly) to be sent in a separate preempt channel, rather than queued in
|
|
|
|
-the background migration channel. Anyone who cares about latencies of page
|
|
|
|
-faults during a postcopy migration should enable this feature. By default,
|
|
|
|
-it's not enabled.
|
|
|
|
-
|
|
|