123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314 |
- ========
- Postcopy
- ========
- .. contents::
- 'Postcopy' migration is a way to deal with migrations that refuse to converge
- (or take too long to converge) its plus side is that there is an upper bound on
- the amount of migration traffic and time it takes, the down side is that during
- the postcopy phase, a failure of *either* side causes the guest to be lost.
- In postcopy the destination CPUs are started before all the memory has been
- transferred, and accesses to pages that are yet to be transferred cause
- a fault that's translated by QEMU into a request to the source QEMU.
- Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
- doesn't finish in a given time the switch is made to postcopy.
- Enabling postcopy
- =================
- To enable postcopy, issue this command on the monitor (both source and
- destination) prior to the start of migration:
- ``migrate_set_capability postcopy-ram on``
- The normal commands are then used to start a migration, which is still
- started in precopy mode. Issuing:
- ``migrate_start_postcopy``
- will now cause the transition from precopy to postcopy.
- It can be issued immediately after migration is started or any
- time later on. Issuing it after the end of a migration is harmless.
- Blocktime is a postcopy live migration metric, intended to show how
- long the vCPU was in state of interruptible sleep due to pagefault.
- That metric is calculated both for all vCPUs as overlapped value, and
- separately for each vCPU. These values are calculated on destination
- side. To enable postcopy blocktime calculation, enter following
- command on destination monitor:
- ``migrate_set_capability postcopy-blocktime on``
- Postcopy blocktime can be retrieved by query-migrate qmp command.
- postcopy-blocktime value of qmp command will show overlapped blocking
- time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
- time per vCPU.
- .. note::
- During the postcopy phase, the bandwidth limits set using
- ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
- the destination is waiting for).
- Postcopy internals
- ==================
- State machine
- -------------
- Postcopy moves through a series of states (see postcopy_state) from
- ADVISE->DISCARD->LISTEN->RUNNING->END
- - Advise
- Set at the start of migration if postcopy is enabled, even
- if it hasn't had the start command; here the destination
- checks that its OS has the support needed for postcopy, and performs
- setup to ensure the RAM mappings are suitable for later postcopy.
- The destination will fail early in migration at this point if the
- required OS support is not present.
- (Triggered by reception of POSTCOPY_ADVISE command)
- - Discard
- Entered on receipt of the first 'discard' command; prior to
- the first Discard being performed, hugepages are switched off
- (using madvise) to ensure that no new huge pages are created
- during the postcopy phase, and to cause any huge pages that
- have discards on them to be broken.
- - Listen
- The first command in the package, POSTCOPY_LISTEN, switches
- the destination state to Listen, and starts a new thread
- (the 'listen thread') which takes over the job of receiving
- pages off the migration stream, while the main thread carries
- on processing the blob. With this thread able to process page
- reception, the destination now 'sensitises' the RAM to detect
- any access to missing pages (on Linux using the 'userfault'
- system).
- - Running
- POSTCOPY_RUN causes the destination to synchronise all
- state and start the CPUs and IO devices running. The main
- thread now finishes processing the migration package and
- now carries on as it would for normal precopy migration
- (although it can't do the cleanup it would do as it
- finishes a normal migration).
- - End
- The listen thread can now quit, and perform the cleanup of migration
- state, the migration is now complete.
- Device transfer
- ---------------
- Loading of device data may cause the device emulation to access guest RAM
- that may trigger faults that have to be resolved by the source, as such
- the migration stream has to be able to respond with page data *during* the
- device load, and hence the device data has to be read from the stream completely
- before the device load begins to free the stream up. This is achieved by
- 'packaging' the device data into a blob that's read in one go.
- Source behaviour
- ----------------
- Until postcopy is entered the migration stream is identical to normal
- precopy, except for the addition of a 'postcopy advise' command at
- the beginning, to tell the destination that postcopy might happen.
- When postcopy starts the source sends the page discard data and then
- forms the 'package' containing:
- - Command: 'postcopy listen'
- - The device state
- A series of sections, identical to the precopy streams device state stream
- containing everything except postcopiable devices (i.e. RAM)
- - Command: 'postcopy run'
- The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
- contents are formatted in the same way as the main migration stream.
- During postcopy the source scans the list of dirty pages and sends them
- to the destination without being requested (in much the same way as precopy),
- however when a page request is received from the destination, the dirty page
- scanning restarts from the requested location. This causes requested pages
- to be sent quickly, and also causes pages directly after the requested page
- to be sent quickly in the hope that those pages are likely to be used
- by the destination soon.
- Destination behaviour
- ---------------------
- Initially the destination looks the same as precopy, with a single thread
- reading the migration stream; the 'postcopy advise' and 'discard' commands
- are processed to change the way RAM is managed, but don't affect the stream
- processing.
- ::
- ------------------------------------------------------------------------------
- 1 2 3 4 5 6 7
- main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
- thread | |
- | (page request)
- | \___
- v \
- listen thread: --- page -- page -- page -- page -- page --
- a b c
- ------------------------------------------------------------------------------
- - On receipt of ``CMD_PACKAGED`` (1)
- All the data associated with the package - the ( ... ) section in the diagram -
- is read into memory, and the main thread recurses into qemu_loadvm_state_main
- to process the contents of the package (2) which contains commands (3,6) and
- devices (4...)
- - On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
- a new thread (a) is started that takes over servicing the migration stream,
- while the main thread carries on loading the package. It loads normal
- background page data (b) but if during a device load a fault happens (5)
- the returned page (c) is loaded by the listen thread allowing the main
- threads device load to carry on.
- - The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
- letting the destination CPUs start running. At the end of the
- ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
- is no longer used by migration, while the listen thread carries on servicing
- page data until the end of migration.
- Source side page bitmap
- -----------------------
- The 'migration bitmap' in postcopy is basically the same as in the precopy,
- where each of the bit to indicate that page is 'dirty' - i.e. needs
- sending. During the precopy phase this is updated as the CPU dirties
- pages, however during postcopy the CPUs are stopped and nothing should
- dirty anything any more. Instead, dirty bits are cleared when the relevant
- pages are sent during postcopy.
- Postcopy features
- =================
- Postcopy recovery
- -----------------
- Comparing to precopy, postcopy is special on error handlings. When any
- error happens (in this case, mostly network errors), QEMU cannot easily
- fail a migration because VM data resides in both source and destination
- QEMU instances. On the other hand, when issue happens QEMU on both sides
- will go into a paused state. It'll need a recovery phase to continue a
- paused postcopy migration.
- The recovery phase normally contains a few steps:
- - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
- migration state.
- - When the network is recovered (or a new network is provided), the admin
- can setup the new channel for migration using QMP command
- 'migrate-recover' on destination node, preparing for a resume.
- - On source host, the admin can continue the interrupted postcopy
- migration using QMP command 'migrate' with resume=true flag set.
- Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
- re-establish the channels.
- - When both sides of QEMU successfully reconnect using a new or fixed up
- channel, they will go into **POSTCOPY_RECOVER** state, some handshake
- procedure will be needed to properly synchronize the VM states between
- the two QEMUs to continue the postcopy migration. For example, there
- can be pages sent right during the window when the network is
- interrupted, then the handshake will guarantee pages lost in-flight
- will be resent again.
- - After a proper handshake synchronization, QEMU will continue the
- postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
- state. Postcopy migration will continue.
- During a paused postcopy migration, the VM can logically still continue
- running, and it will not be impacted from any page access to pages that
- were already migrated to destination VM before the interruption happens.
- However, if any of the missing pages got accessed on destination VM, the VM
- thread will be halted waiting for the page to be migrated, it means it can
- be halted until the recovery is complete.
- The impact of accessing missing pages can be relevant to different
- configurations of the guest. For example, when with async page fault
- enabled, logically the guest can proactively schedule out the threads
- accessing missing pages.
- Postcopy with hugepages
- -----------------------
- Postcopy now works with hugetlbfs backed memory:
- a) The linux kernel on the destination must support userfault on hugepages.
- b) The huge-page configuration on the source and destination VMs must be
- identical; i.e. RAMBlocks on both sides must use the same page size.
- c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
- RAM if it doesn't have enough hugepages, triggering (b) to fail.
- Using ``-mem-prealloc`` enforces the allocation using hugepages.
- d) Care should be taken with the size of hugepage used; postcopy with 2MB
- hugepages works well, however 1GB hugepages are likely to be problematic
- since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
- and until the full page is transferred the destination thread is blocked.
- Postcopy with shared memory
- ---------------------------
- Postcopy migration with shared memory needs explicit support from the other
- processes that share memory and from QEMU. There are restrictions on the type of
- memory that userfault can support shared.
- The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
- (although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
- for hugetlbfs which may be a problem in some configurations).
- The vhost-user code in QEMU supports clients that have Postcopy support,
- and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
- to support postcopy.
- The client needs to open a userfaultfd and register the areas
- of memory that it maps with userfault. The client must then pass the
- userfaultfd back to QEMU together with a mapping table that allows
- fault addresses in the clients address space to be converted back to
- RAMBlock/offsets. The client's userfaultfd is added to the postcopy
- fault-thread and page requests are made on behalf of the client by QEMU.
- QEMU performs 'wake' operations on the client's userfaultfd to allow it
- to continue after a page has arrived.
- .. note::
- There are two future improvements that would be nice:
- a) Some way to make QEMU ignorant of the addresses in the clients
- address space
- b) Avoiding the need for QEMU to perform ufd-wake calls after the
- pages have arrived
- Retro-fitting postcopy to existing clients is possible:
- a) A mechanism is needed for the registration with userfault as above,
- and the registration needs to be coordinated with the phases of
- postcopy. In vhost-user extra messages are added to the existing
- control channel.
- b) Any thread that can block due to guest memory accesses must be
- identified and the implication understood; for example if the
- guest memory access is made while holding a lock then all other
- threads waiting for that lock will also be blocked.
- Postcopy preemption mode
- ------------------------
- Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
- allows urgent pages (those got page fault requested from destination QEMU
- explicitly) to be sent in a separate preempt channel, rather than queued in
- the background migration channel. Anyone who cares about latencies of page
- faults during a postcopy migration should enable this feature. By default,
- it's not enabled.
|