multi-process.rst 40 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969
  1. Multi-process QEMU
  2. ===================
  3. .. note::
  4. This is the design document for multi-process QEMU. It does not
  5. necessarily reflect the status of the current implementation, which
  6. may lack features or be considerably different from what is described
  7. in this document. This document is still useful as a description of
  8. the goals and general direction of this feature.
  9. Please refer to the following wiki for latest details:
  10. https://wiki.qemu.org/Features/MultiProcessQEMU
  11. QEMU is often used as the hypervisor for virtual machines running in the
  12. Oracle cloud. Since one of the advantages of cloud computing is the
  13. ability to run many VMs from different tenants in the same cloud
  14. infrastructure, a guest that compromised its hypervisor could
  15. potentially use the hypervisor's access privileges to access data it is
  16. not authorized for.
  17. QEMU can be susceptible to security attacks because it is a large,
  18. monolithic program that provides many features to the VMs it services.
  19. Many of these features can be configured out of QEMU, but even a reduced
  20. configuration QEMU has a large amount of code a guest can potentially
  21. attack. Separating QEMU reduces the attack surface by aiding to
  22. limit each component in the system to only access the resources that
  23. it needs to perform its job.
  24. QEMU services
  25. -------------
  26. QEMU can be broadly described as providing three main services. One is a
  27. VM control point, where VMs can be created, migrated, re-configured, and
  28. destroyed. A second is to emulate the CPU instructions within the VM,
  29. often accelerated by HW virtualization features such as Intel's VT
  30. extensions. Finally, it provides IO services to the VM by emulating HW
  31. IO devices, such as disk and network devices.
  32. A multi-process QEMU
  33. ~~~~~~~~~~~~~~~~~~~~
  34. A multi-process QEMU involves separating QEMU services into separate
  35. host processes. Each of these processes can be given only the privileges
  36. it needs to provide its service, e.g., a disk service could be given
  37. access only to the disk images it provides, and not be allowed to
  38. access other files, or any network devices. An attacker who compromised
  39. this service would not be able to use this exploit to access files or
  40. devices beyond what the disk service was given access to.
  41. A QEMU control process would remain, but in multi-process mode, will
  42. have no direct interfaces to the VM. During VM execution, it would still
  43. provide the user interface to hot-plug devices or live migrate the VM.
  44. A first step in creating a multi-process QEMU is to separate IO services
  45. from the main QEMU program, which would continue to provide CPU
  46. emulation. i.e., the control process would also be the CPU emulation
  47. process. In a later phase, CPU emulation could be separated from the
  48. control process.
  49. Separating IO services
  50. ----------------------
  51. Separating IO services into individual host processes is a good place to
  52. begin for a couple of reasons. One is the sheer number of IO devices QEMU
  53. can emulate provides a large surface of interfaces which could potentially
  54. be exploited, and, indeed, have been a source of exploits in the past.
  55. Another is the modular nature of QEMU device emulation code provides
  56. interface points where the QEMU functions that perform device emulation
  57. can be separated from the QEMU functions that manage the emulation of
  58. guest CPU instructions. The devices emulated in the separate process are
  59. referred to as remote devices.
  60. QEMU device emulation
  61. ~~~~~~~~~~~~~~~~~~~~~
  62. QEMU uses an object oriented SW architecture for device emulation code.
  63. Configured objects are all compiled into the QEMU binary, then objects
  64. are instantiated by name when used by the guest VM. For example, the
  65. code to emulate a device named "foo" is always present in QEMU, but its
  66. instantiation code is only run when the device is included in the target
  67. VM. (e.g., via the QEMU command line as *-device foo*)
  68. The object model is hierarchical, so device emulation code names its
  69. parent object (such as "pci-device" for a PCI device) and QEMU will
  70. instantiate a parent object before calling the device's instantiation
  71. code.
  72. Current separation models
  73. ~~~~~~~~~~~~~~~~~~~~~~~~~
  74. In order to separate the device emulation code from the CPU emulation
  75. code, the device object code must run in a different process. There are
  76. a couple of existing QEMU features that can run emulation code
  77. separately from the main QEMU process. These are examined below.
  78. vhost user model
  79. ^^^^^^^^^^^^^^^^
  80. Virtio guest device drivers can be connected to vhost user applications
  81. in order to perform their IO operations. This model uses special virtio
  82. device drivers in the guest and vhost user device objects in QEMU, but
  83. once the QEMU vhost user code has configured the vhost user application,
  84. mission-mode IO is performed by the application. The vhost user
  85. application is a daemon process that can be contacted via a known UNIX
  86. domain socket.
  87. vhost socket
  88. ''''''''''''
  89. As mentioned above, one of the tasks of the vhost device object within
  90. QEMU is to contact the vhost application and send it configuration
  91. information about this device instance. As part of the configuration
  92. process, the application can also be sent other file descriptors over
  93. the socket, which then can be used by the vhost user application in
  94. various ways, some of which are described below.
  95. vhost MMIO store acceleration
  96. '''''''''''''''''''''''''''''
  97. VMs are often run using HW virtualization features via the KVM kernel
  98. driver. This driver allows QEMU to accelerate the emulation of guest CPU
  99. instructions by running the guest in a virtual HW mode. When the guest
  100. executes instructions that cannot be executed by virtual HW mode,
  101. execution returns to the KVM driver so it can inform QEMU to emulate the
  102. instructions in SW.
  103. One of the events that can cause a return to QEMU is when a guest device
  104. driver accesses an IO location. QEMU then dispatches the memory
  105. operation to the corresponding QEMU device object. In the case of a
  106. vhost user device, the memory operation would need to be sent over a
  107. socket to the vhost application. This path is accelerated by the QEMU
  108. virtio code by setting up an eventfd file descriptor that the vhost
  109. application can directly receive MMIO store notifications from the KVM
  110. driver, instead of needing them to be sent to the QEMU process first.
  111. vhost interrupt acceleration
  112. ''''''''''''''''''''''''''''
  113. Another optimization used by the vhost application is the ability to
  114. directly inject interrupts into the VM via the KVM driver, again,
  115. bypassing the need to send the interrupt back to the QEMU process first.
  116. The QEMU virtio setup code configures the KVM driver with an eventfd
  117. that triggers the device interrupt in the guest when the eventfd is
  118. written. This irqfd file descriptor is then passed to the vhost user
  119. application program.
  120. vhost access to guest memory
  121. ''''''''''''''''''''''''''''
  122. The vhost application is also allowed to directly access guest memory,
  123. instead of needing to send the data as messages to QEMU. This is also
  124. done with file descriptors sent to the vhost user application by QEMU.
  125. These descriptors can be passed to ``mmap()`` by the vhost application
  126. to map the guest address space into the vhost application.
  127. IOMMUs introduce another level of complexity, since the address given to
  128. the guest virtio device to DMA to or from is not a guest physical
  129. address. This case is handled by having vhost code within QEMU register
  130. as a listener for IOMMU mapping changes. The vhost application maintains
  131. a cache of IOMMMU translations: sending translation requests back to
  132. QEMU on cache misses, and in turn receiving flush requests from QEMU
  133. when mappings are purged.
  134. applicability to device separation
  135. ''''''''''''''''''''''''''''''''''
  136. Much of the vhost model can be re-used by separated device emulation. In
  137. particular, the ideas of using a socket between QEMU and the device
  138. emulation application, using a file descriptor to inject interrupts into
  139. the VM via KVM, and allowing the application to ``mmap()`` the guest
  140. should be re used.
  141. There are, however, some notable differences between how a vhost
  142. application works and the needs of separated device emulation. The most
  143. basic is that vhost uses custom virtio device drivers which always
  144. trigger IO with MMIO stores. A separated device emulation model must
  145. work with existing IO device models and guest device drivers. MMIO loads
  146. break vhost store acceleration since they are synchronous - guest
  147. progress cannot continue until the load has been emulated. By contrast,
  148. stores are asynchronous, the guest can continue after the store event
  149. has been sent to the vhost application.
  150. Another difference is that in the vhost user model, a single daemon can
  151. support multiple QEMU instances. This is contrary to the security regime
  152. desired, in which the emulation application should only be allowed to
  153. access the files or devices the VM it's running on behalf of can access.
  154. #### qemu-io model
  155. ``qemu-io`` is a test harness used to test changes to the QEMU block backend
  156. object code (e.g., the code that implements disk images for disk driver
  157. emulation). ``qemu-io`` is not a device emulation application per se, but it
  158. does compile the QEMU block objects into a separate binary from the main
  159. QEMU one. This could be useful for disk device emulation, since its
  160. emulation applications will need to include the QEMU block objects.
  161. New separation model based on proxy objects
  162. -------------------------------------------
  163. A different model based on proxy objects in the QEMU program
  164. communicating with remote emulation programs could provide separation
  165. while minimizing the changes needed to the device emulation code. The
  166. rest of this section is a discussion of how a proxy object model would
  167. work.
  168. Remote emulation processes
  169. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  170. The remote emulation process will run the QEMU object hierarchy without
  171. modification. The device emulation objects will be also be based on the
  172. QEMU code, because for anything but the simplest device, it would not be
  173. a tractable to re-implement both the object model and the many device
  174. backends that QEMU has.
  175. The processes will communicate with the QEMU process over UNIX domain
  176. sockets. The processes can be executed either as standalone processes,
  177. or be executed by QEMU. In both cases, the host backends the emulation
  178. processes will provide are specified on its command line, as they would
  179. be for QEMU. For example:
  180. ::
  181. disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \
  182. -blockdev driver=qcow2,node-name=drive0,file=file0
  183. would indicate process *disk-proc* uses a qcow2 emulated disk named
  184. *file0* as its backend.
  185. Emulation processes may emulate more than one guest controller. A common
  186. configuration might be to put all controllers of the same device class
  187. (e.g., disk, network, etc.) in a single process, so that all backends of
  188. the same type can be managed by a single QMP monitor.
  189. communication with QEMU
  190. ^^^^^^^^^^^^^^^^^^^^^^^
  191. The first argument to the remote emulation process will be a Unix domain
  192. socket that connects with the Proxy object. This is a required argument.
  193. ::
  194. disk-proc <socket number> <backend list>
  195. remote process QMP monitor
  196. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  197. Remote emulation processes can be monitored via QMP, similar to QEMU
  198. itself. The QMP monitor socket is specified the same as for a QEMU
  199. process:
  200. ::
  201. disk-proc -qmp unix:/tmp/disk-mon,server
  202. can be monitored over the UNIX socket path */tmp/disk-mon*.
  203. QEMU command line
  204. ~~~~~~~~~~~~~~~~~
  205. Each remote device emulated in a remote process on the host is
  206. represented as a *-device* of type *pci-proxy-dev*. A socket
  207. sub-option to this option specifies the Unix socket that connects
  208. to the remote process. An *id* sub-option is required, and it should
  209. be the same id as used in the remote process.
  210. ::
  211. qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
  212. can be used to add a device emulated in a remote process
  213. QEMU management of remote processes
  214. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  215. QEMU is not aware of the type of type of the remote PCI device. It is
  216. a pass through device as far as QEMU is concerned.
  217. communication with emulation process
  218. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  219. primary channel
  220. '''''''''''''''
  221. The primary channel (referred to as com in the code) is used to bootstrap
  222. the remote process. It is also used to pass on device-agnostic commands
  223. like reset.
  224. per-device channels
  225. '''''''''''''''''''
  226. Each remote device communicates with QEMU using a dedicated communication
  227. channel. The proxy object sets up this channel using the primary
  228. channel during its initialization.
  229. QEMU device proxy objects
  230. ~~~~~~~~~~~~~~~~~~~~~~~~~
  231. QEMU has an object model based on sub-classes inherited from the
  232. "object" super-class. The sub-classes that are of interest here are the
  233. "device" and "bus" sub-classes whose child sub-classes make up the
  234. device tree of a QEMU emulated system.
  235. The proxy object model will use device proxy objects to replace the
  236. device emulation code within the QEMU process. These objects will live
  237. in the same place in the object and bus hierarchies as the objects they
  238. replace. i.e., the proxy object for an LSI SCSI controller will be a
  239. sub-class of the "pci-device" class, and will have the same PCI bus
  240. parent and the same SCSI bus child objects as the LSI controller object
  241. it replaces.
  242. It is worth noting that the same proxy object is used to mediate with
  243. all types of remote PCI devices.
  244. object initialization
  245. ^^^^^^^^^^^^^^^^^^^^^
  246. The Proxy device objects are initialized in the exact same manner in
  247. which any other QEMU device would be initialized.
  248. In addition, the Proxy objects perform the following two tasks:
  249. - Parses the "socket" sub option and connects to the remote process
  250. using this channel
  251. - Uses the "id" sub-option to connect to the emulated device on the
  252. separate process
  253. class\_init
  254. '''''''''''
  255. The ``class_init()`` method of a proxy object will, in general behave
  256. similarly to the object it replaces, including setting any static
  257. properties and methods needed by the proxy.
  258. instance\_init / realize
  259. ''''''''''''''''''''''''
  260. The ``instance_init()`` and ``realize()`` functions would only need to
  261. perform tasks related to being a proxy, such are registering its own
  262. MMIO handlers, or creating a child bus that other proxy devices can be
  263. attached to later.
  264. Other tasks will be device-specific. For example, PCI device objects
  265. will initialize the PCI config space in order to make a valid PCI device
  266. tree within the QEMU process.
  267. address space registration
  268. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  269. Most devices are driven by guest device driver accesses to IO addresses
  270. or ports. The QEMU device emulation code uses QEMU's memory region
  271. function calls (such as ``memory_region_init_io()``) to add callback
  272. functions that QEMU will invoke when the guest accesses the device's
  273. areas of the IO address space. When a guest driver does access the
  274. device, the VM will exit HW virtualization mode and return to QEMU,
  275. which will then lookup and execute the corresponding callback function.
  276. A proxy object would need to mirror the memory region calls the actual
  277. device emulator would perform in its initialization code, but with its
  278. own callbacks. When invoked by QEMU as a result of a guest IO operation,
  279. they will forward the operation to the device emulation process.
  280. PCI config space
  281. ^^^^^^^^^^^^^^^^
  282. PCI devices also have a configuration space that can be accessed by the
  283. guest driver. Guest accesses to this space is not handled by the device
  284. emulation object, but by its PCI parent object. Much of this space is
  285. read-only, but certain registers (especially BAR and MSI-related ones)
  286. need to be propagated to the emulation process.
  287. PCI parent proxy
  288. ''''''''''''''''
  289. One way to propagate guest PCI config accesses is to create a
  290. "pci-device-proxy" class that can serve as the parent of a PCI device
  291. proxy object. This class's parent would be "pci-device" and it would
  292. override the PCI parent's ``config_read()`` and ``config_write()``
  293. methods with ones that forward these operations to the emulation
  294. program.
  295. interrupt receipt
  296. ^^^^^^^^^^^^^^^^^
  297. A proxy for a device that generates interrupts will need to create a
  298. socket to receive interrupt indications from the emulation process. An
  299. incoming interrupt indication would then be sent up to its bus parent to
  300. be injected into the guest. For example, a PCI device object may use
  301. ``pci_set_irq()``.
  302. live migration
  303. ^^^^^^^^^^^^^^
  304. The proxy will register to save and restore any *vmstate* it needs over
  305. a live migration event. The device proxy does not need to manage the
  306. remote device's *vmstate*; that will be handled by the remote process
  307. proxy (see below).
  308. QEMU remote device operation
  309. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  310. Generic device operations, such as DMA, will be performed by the remote
  311. process proxy by sending messages to the remote process.
  312. DMA operations
  313. ^^^^^^^^^^^^^^
  314. DMA operations would be handled much like vhost applications do. One of
  315. the initial messages sent to the emulation process is a guest memory
  316. table. Each entry in this table consists of a file descriptor and size
  317. that the emulation process can ``mmap()`` to directly access guest
  318. memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
  319. must be backed by shared file-backed memory, for example, using
  320. *-object memory-backend-file,share=on* and setting that memory backend
  321. as RAM for the machine.
  322. IOMMU operations
  323. ^^^^^^^^^^^^^^^^
  324. When the emulated system includes an IOMMU, the remote process proxy in
  325. QEMU will need to create a socket for IOMMU requests from the emulation
  326. process. It will handle those requests with an
  327. ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
  328. unmaps, the remote process proxy will also register as a listener on the
  329. device's DMA address space. When an IOMMU memory region is created
  330. within the DMA address space, an IOMMU notifier for unmaps will be added
  331. to the memory region that will forward unmaps to the emulation process
  332. over the IOMMU socket.
  333. device hot-plug via QMP
  334. ^^^^^^^^^^^^^^^^^^^^^^^
  335. An QMP "device\_add" command can add a device emulated by a remote
  336. process. It will also have "rid" option to the command, just as the
  337. *-device* command line option does. The remote process may either be one
  338. started at QEMU startup, or be one added by the "add-process" QMP
  339. command described above. In either case, the remote process proxy will
  340. forward the new device's JSON description to the corresponding emulation
  341. process.
  342. live migration
  343. ^^^^^^^^^^^^^^
  344. The remote process proxy will also register for live migration
  345. notifications with ``vmstate_register()``. When called to save state,
  346. the proxy will send the remote process a secondary socket file
  347. descriptor to save the remote process's device *vmstate* over. The
  348. incoming byte stream length and data will be saved as the proxy's
  349. *vmstate*. When the proxy is resumed on its new host, this *vmstate*
  350. will be extracted, and a secondary socket file descriptor will be sent
  351. to the new remote process through which it receives the *vmstate* in
  352. order to restore the devices there.
  353. device emulation in remote process
  354. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  355. The parts of QEMU that the emulation program will need include the
  356. object model; the memory emulation objects; the device emulation objects
  357. of the targeted device, and any dependent devices; and, the device's
  358. backends. It will also need code to setup the machine environment,
  359. handle requests from the QEMU process, and route machine-level requests
  360. (such as interrupts or IOMMU mappings) back to the QEMU process.
  361. initialization
  362. ^^^^^^^^^^^^^^
  363. The process initialization sequence will follow the same sequence
  364. followed by QEMU. It will first initialize the backend objects, then
  365. device emulation objects. The JSON descriptions sent by the QEMU process
  366. will drive which objects need to be created.
  367. - address spaces
  368. Before the device objects are created, the initial address spaces and
  369. memory regions must be configured with ``memory_map_init()``. This
  370. creates a RAM memory region object (*system\_memory*) and an IO memory
  371. region object (*system\_io*).
  372. - RAM
  373. RAM memory region creation will follow how ``pc_memory_init()`` creates
  374. them, but must use ``memory_region_init_ram_from_fd()`` instead of
  375. ``memory_region_allocate_system_memory()``. The file descriptors needed
  376. will be supplied by the guest memory table from above. Those RAM regions
  377. would then be added to the *system\_memory* memory region with
  378. ``memory_region_add_subregion()``.
  379. - PCI
  380. IO initialization will be driven by the JSON descriptions sent from the
  381. QEMU process. For a PCI device, a PCI bus will need to be created with
  382. ``pci_root_bus_new()``, and a PCI memory region will need to be created
  383. and added to the *system\_memory* memory region with
  384. ``memory_region_add_subregion_overlap()``. The overlap version is
  385. required for architectures where PCI memory overlaps with RAM memory.
  386. MMIO handling
  387. ^^^^^^^^^^^^^
  388. The device emulation objects will use ``memory_region_init_io()`` to
  389. install their MMIO handlers, and ``pci_register_bar()`` to associate
  390. those handlers with a PCI BAR, as they do within QEMU currently.
  391. In order to use ``address_space_rw()`` in the emulation process to
  392. handle MMIO requests from QEMU, the PCI physical addresses must be the
  393. same in the QEMU process and the device emulation process. In order to
  394. accomplish that, guest BAR programming must also be forwarded from QEMU
  395. to the emulation process.
  396. interrupt injection
  397. ^^^^^^^^^^^^^^^^^^^
  398. When device emulation wants to inject an interrupt into the VM, the
  399. request climbs the device's bus object hierarchy until the point where a
  400. bus object knows how to signal the interrupt to the guest. The details
  401. depend on the type of interrupt being raised.
  402. - PCI pin interrupts
  403. On x86 systems, there is an emulated IOAPIC object attached to the root
  404. PCI bus object, and the root PCI object forwards interrupt requests to
  405. it. The IOAPIC object, in turn, calls the KVM driver to inject the
  406. corresponding interrupt into the VM. The simplest way to handle this in
  407. an emulation process would be to setup the root PCI bus driver (via
  408. ``pci_bus_irqs()``) to send a interrupt request back to the QEMU
  409. process, and have the device proxy object reflect it up the PCI tree
  410. there.
  411. - PCI MSI/X interrupts
  412. PCI MSI/X interrupts are implemented in HW as DMA writes to a
  413. CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
  414. these DMA writes, then calls into the KVM driver to inject the interrupt
  415. into the VM. A simple emulation process implementation would be to send
  416. the MSI DMA address from QEMU as a message at initialization, then
  417. install an address space handler at that address which forwards the MSI
  418. message back to QEMU.
  419. DMA operations
  420. ^^^^^^^^^^^^^^
  421. When a emulation object wants to DMA into or out of guest memory, it
  422. first must use dma\_memory\_map() to convert the DMA address to a local
  423. virtual address. The emulation process memory region objects setup above
  424. will be used to translate the DMA address to a local virtual address the
  425. device emulation code can access.
  426. IOMMU
  427. ^^^^^
  428. When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
  429. regions to translate the DMA address to a guest physical address before
  430. that physical address can be translated to a local virtual address. The
  431. emulation process will need similar functionality.
  432. - IOTLB cache
  433. The emulation process will maintain a cache of recent IOMMU translations
  434. (the IOTLB). When the translate() callback of an IOMMU memory region is
  435. invoked, the IOTLB cache will be searched for an entry that will map the
  436. DMA address to a guest PA. On a cache miss, a message will be sent back
  437. to QEMU requesting the corresponding translation entry, which be both be
  438. used to return a guest address and be added to the cache.
  439. - IOTLB purge
  440. The IOMMU emulation will also need to act on unmap requests from QEMU.
  441. These happen when the guest IOMMU driver purges an entry from the
  442. guest's translation table.
  443. live migration
  444. ^^^^^^^^^^^^^^
  445. When a remote process receives a live migration indication from QEMU, it
  446. will set up a channel using the received file descriptor with
  447. ``qio_channel_socket_new_fd()``. This channel will be used to create a
  448. *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
  449. the process's device state back to QEMU. This method will be reversed on
  450. restore - the channel will be passed to ``qemu_loadvm_state()`` to
  451. restore the device state.
  452. Accelerating device emulation
  453. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  454. The messages that are required to be sent between QEMU and the emulation
  455. process can add considerable latency to IO operations. The optimizations
  456. described below attempt to ameliorate this effect by allowing the
  457. emulation process to communicate directly with the kernel KVM driver.
  458. The KVM file descriptors created would be passed to the emulation process
  459. via initialization messages, much like the guest memory table is done.
  460. #### MMIO acceleration
  461. Vhost user applications can receive guest virtio driver stores directly
  462. from KVM. The issue with the eventfd mechanism used by vhost user is
  463. that it does not pass any data with the event indication, so it cannot
  464. handle guest loads or guest stores that carry store data. This concept
  465. could, however, be expanded to cover more cases.
  466. The expanded idea would require a new type of KVM device:
  467. *KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
  468. descriptor that QEMU can use for configuration, and a slave descriptor
  469. that the emulation process can use to receive MMIO notifications. QEMU
  470. would create both descriptors using the KVM driver, and pass the slave
  471. descriptor to the emulation process via an initialization message.
  472. data structures
  473. ^^^^^^^^^^^^^^^
  474. - guest physical range
  475. The guest physical range structure describes the address range that a
  476. device will respond to. It includes the base and length of the range, as
  477. well as which bus the range resides on (e.g., on an x86machine, it can
  478. specify whether the range refers to memory or IO addresses).
  479. A device can have multiple physical address ranges it responds to (e.g.,
  480. a PCI device can have multiple BARs), so the structure will also include
  481. an enumerated identifier to specify which of the device's ranges is
  482. being referred to.
  483. +--------+----------------------------+
  484. | Name | Description |
  485. +========+============================+
  486. | addr | range base address |
  487. +--------+----------------------------+
  488. | len | range length |
  489. +--------+----------------------------+
  490. | bus | addr type (memory or IO) |
  491. +--------+----------------------------+
  492. | id | range ID (e.g., PCI BAR) |
  493. +--------+----------------------------+
  494. - MMIO request structure
  495. This structure describes an MMIO operation. It includes which guest
  496. physical range the MMIO was within, the offset within that range, the
  497. MMIO type (e.g., load or store), and its length and data. It also
  498. includes a sequence number that can be used to reply to the MMIO, and
  499. the CPU that issued the MMIO.
  500. +----------+------------------------+
  501. | Name | Description |
  502. +==========+========================+
  503. | rid | range MMIO is within |
  504. +----------+------------------------+
  505. | offset | offset within *rid* |
  506. +----------+------------------------+
  507. | type | e.g., load or store |
  508. +----------+------------------------+
  509. | len | MMIO length |
  510. +----------+------------------------+
  511. | data | store data |
  512. +----------+------------------------+
  513. | seq | sequence ID |
  514. +----------+------------------------+
  515. - MMIO request queues
  516. MMIO request queues are FIFO arrays of MMIO request structures. There
  517. are two queues: pending queue is for MMIOs that haven't been read by the
  518. emulation program, and the sent queue is for MMIOs that haven't been
  519. acknowledged. The main use of the second queue is to validate MMIO
  520. replies from the emulation program.
  521. - scoreboard
  522. Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
  523. MMIOs may be waiting to be consumed by an emulation program and multiple
  524. threads may be waiting for MMIO replies. The scoreboard would contain a
  525. wait queue and sequence number for the per-CPU threads, allowing them to
  526. be individually woken when the MMIO reply is received from the emulation
  527. program. It also tracks the number of posted MMIO stores to the device
  528. that haven't been replied to, in order to satisfy the PCI constraint
  529. that a load to a device will not complete until all previous stores to
  530. that device have been completed.
  531. - device shadow memory
  532. Some MMIO loads do not have device side-effects. These MMIOs can be
  533. completed without sending a MMIO request to the emulation program if the
  534. emulation program shares a shadow image of the device's memory image
  535. with the KVM driver.
  536. The emulation program will ask the KVM driver to allocate memory for the
  537. shadow image, and will then use ``mmap()`` to directly access it. The
  538. emulation program can control KVM access to the shadow image by sending
  539. KVM an access map telling it which areas of the image have no
  540. side-effects (and can be completed immediately), and which require a
  541. MMIO request to the emulation program. The access map can also inform
  542. the KVM drive which size accesses are allowed to the image.
  543. master descriptor
  544. ^^^^^^^^^^^^^^^^^
  545. The master descriptor is used by QEMU to configure the new KVM device.
  546. The descriptor would be returned by the KVM driver when QEMU issues a
  547. *KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
  548. KVM\_DEV\_TYPE\_USER device ops
  549. The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
  550. ``kvm_register_device_ops()`` call when the KVM system in initialized by
  551. ``kvm_init()``. These device ops are called by the KVM driver when QEMU
  552. executes certain ``ioctl()`` operations on its KVM file descriptor. They
  553. include:
  554. - create
  555. This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
  556. ``ioctl()`` on its per-VM file descriptor. It will allocate and
  557. initialize a KVM user device specific data structure, and assign the
  558. *kvm\_device* private field to it.
  559. - ioctl
  560. This routine is invoked when QEMU issues an ``ioctl()`` on the master
  561. descriptor. The ``ioctl()`` commands supported are defined by the KVM
  562. device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
  563. *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
  564. be passed to the device emulation program. Only one slave can be created
  565. by each master descriptor. The file operations performed by this
  566. descriptor are described below.
  567. The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
  568. address range that the slave descriptor will receive MMIO notifications
  569. for. The range is specified by a guest physical range structure
  570. argument. For buses that assign addresses to devices dynamically, this
  571. command can be executed while the guest is running, such as the case
  572. when a guest changes a device's PCI BAR registers.
  573. *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
  574. register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
  575. performs a MMIO operation within the range. When a range is changed,
  576. ``kvm_io_bus_unregister_dev()`` is used to remove the previous
  577. instantiation.
  578. *KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
  579. how long KVM will wait for the emulation process to respond to a MMIO
  580. indication.
  581. - destroy
  582. This routine is called when the VM instance is destroyed. It will need
  583. to destroy the slave descriptor; and free any memory allocated by the
  584. driver, as well as the *kvm\_device* structure itself.
  585. slave descriptor
  586. ^^^^^^^^^^^^^^^^
  587. The slave descriptor will have its own file operations vector, which
  588. responds to system calls on the descriptor performed by the device
  589. emulation program.
  590. - read
  591. A read returns any pending MMIO requests from the KVM driver as MMIO
  592. request structures. Multiple structures can be returned if there are
  593. multiple MMIO operations pending. The MMIO requests are moved from the
  594. pending queue to the sent queue, and if there are threads waiting for
  595. space in the pending to add new MMIO operations, they will be woken
  596. here.
  597. - write
  598. A write also consists of a set of MMIO requests. They are compared to
  599. the MMIO requests in the sent queue. Matches are removed from the sent
  600. queue, and any threads waiting for the reply are woken. If a store is
  601. removed, then the number of posted stores in the per-CPU scoreboard is
  602. decremented. When the number is zero, and a non side-effect load was
  603. waiting for posted stores to complete, the load is continued.
  604. - ioctl
  605. There are several ioctl()s that can be performed on the slave
  606. descriptor.
  607. A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
  608. allocate memory for the shadow image. This memory can later be
  609. ``mmap()``\ ed by the emulation process to share the emulation's view of
  610. device memory with the KVM driver.
  611. A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
  612. shadow image. It will send the KVM driver a shadow control map, which
  613. specifies which areas of the image can complete guest loads without
  614. sending the load request to the emulation program. It will also specify
  615. the size of load operations that are allowed.
  616. - poll
  617. An emulation program will use the ``poll()`` call with a *POLLIN* flag
  618. to determine if there are MMIO requests waiting to be read. It will
  619. return if the pending MMIO request queue is not empty.
  620. - mmap
  621. This call allows the emulation program to directly access the shadow
  622. image allocated by the KVM driver. As device emulation updates device
  623. memory, changes with no side-effects will be reflected in the shadow,
  624. and the KVM driver can satisfy guest loads from the shadow image without
  625. needing to wait for the emulation program.
  626. kvm\_io\_device ops
  627. ^^^^^^^^^^^^^^^^^^^
  628. Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
  629. VM. KVM will use the MMIO's guest physical address to search for a
  630. matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
  631. driver instead of exiting back to QEMU. If a match is found, the
  632. corresponding callback will be invoked.
  633. - read
  634. This callback is invoked when the guest performs a load to the device.
  635. Loads with side-effects must be handled synchronously, with the KVM
  636. driver putting the QEMU thread to sleep waiting for the emulation
  637. process reply before re-starting the guest. Loads that do not have
  638. side-effects may be optimized by satisfying them from the shadow image,
  639. if there are no outstanding stores to the device by this CPU. PCI memory
  640. ordering demands that a load cannot complete before all older stores to
  641. the same device have been completed.
  642. - write
  643. Stores can be handled asynchronously unless the pending MMIO request
  644. queue is full. In this case, the QEMU thread must sleep waiting for
  645. space in the queue. Stores will increment the number of posted stores in
  646. the per-CPU scoreboard, in order to implement the PCI ordering
  647. constraint above.
  648. interrupt acceleration
  649. ^^^^^^^^^^^^^^^^^^^^^^
  650. This performance optimization would work much like a vhost user
  651. application does, where the QEMU process sets up *eventfds* that cause
  652. the device's corresponding interrupt to be triggered by the KVM driver.
  653. These irq file descriptors are sent to the emulation process at
  654. initialization, and are used when the emulation code raises a device
  655. interrupt.
  656. intx acceleration
  657. '''''''''''''''''
  658. Traditional PCI pin interrupts are level based, so, in addition to an
  659. irq file descriptor, a re-sampling file descriptor needs to be sent to
  660. the emulation program. This second file descriptor allows multiple
  661. devices sharing an irq to be notified when the interrupt has been
  662. acknowledged by the guest, so they can re-trigger the interrupt if their
  663. device has not de-asserted its interrupt.
  664. intx irq descriptor
  665. The irq descriptors are created by the proxy object
  666. ``using event_notifier_init()`` to create the irq and re-sampling
  667. *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
  668. The interrupt route can be found with
  669. ``pci_device_route_intx_to_irq()``.
  670. intx routing changes
  671. Intx routing can be changed when the guest programs the APIC the device
  672. pin is connected to. The proxy object in QEMU will use
  673. ``pci_device_set_intx_routing_notifier()`` to be informed of any guest
  674. changes to the route. This handler will broadly follow the VFIO
  675. interrupt logic to change the route: de-assigning the existing irq
  676. descriptor from its route, then assigning it the new route. (see
  677. ``vfio_intx_update()``)
  678. MSI/X acceleration
  679. ''''''''''''''''''
  680. MSI/X interrupts are sent as DMA transactions to the host. The interrupt
  681. data contains a vector that is programmed by the guest, A device may have
  682. multiple MSI interrupts associated with it, so multiple irq descriptors
  683. may need to be sent to the emulation program.
  684. MSI/X irq descriptor
  685. This case will also follow the VFIO example. For each MSI/X interrupt,
  686. an *eventfd* is created, a virtual interrupt is allocated by
  687. ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
  688. the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
  689. MSI/X config space changes
  690. The guest may dynamically update several MSI-related tables in the
  691. device's PCI config space. These include per-MSI interrupt enables and
  692. vector data. Additionally, MSIX tables exist in device memory space, not
  693. config space. Much like the BAR case above, the proxy object must look
  694. at guest config space programming to keep the MSI interrupt state
  695. consistent between QEMU and the emulation program.
  696. --------------
  697. Disaggregated CPU emulation
  698. ---------------------------
  699. After IO services have been disaggregated, a second phase would be to
  700. separate a process to handle CPU instruction emulation from the main
  701. QEMU control function. There are no object separation points for this
  702. code, so the first task would be to create one.
  703. Host access controls
  704. --------------------
  705. Separating QEMU relies on the host OS's access restriction mechanisms to
  706. enforce that the differing processes can only access the objects they
  707. are entitled to. There are a couple types of mechanisms usually provided
  708. by general purpose OSs.
  709. Discretionary access control
  710. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  711. Discretionary access control allows each user to control who can access
  712. their files. In Linux, this type of control is usually too coarse for
  713. QEMU separation, since it only provides three separate access controls:
  714. one for the same user ID, the second for users IDs with the same group
  715. ID, and the third for all other user IDs. Each device instance would
  716. need a separate user ID to provide access control, which is likely to be
  717. unwieldy for dynamically created VMs.
  718. Mandatory access control
  719. ~~~~~~~~~~~~~~~~~~~~~~~~
  720. Mandatory access control allows the OS to add an additional set of
  721. controls on top of discretionary access for the OS to control. It also
  722. adds other attributes to processes and files such as types, roles, and
  723. categories, and can establish rules for how processes and files can
  724. interact.
  725. Type enforcement
  726. ^^^^^^^^^^^^^^^^
  727. Type enforcement assigns a *type* attribute to processes and files, and
  728. allows rules to be written on what operations a process with a given
  729. type can perform on a file with a given type. QEMU separation could take
  730. advantage of type enforcement by running the emulation processes with
  731. different types, both from the main QEMU process, and from the emulation
  732. processes of different classes of devices.
  733. For example, guest disk images and disk emulation processes could have
  734. types separate from the main QEMU process and non-disk emulation
  735. processes, and the type rules could prevent processes other than disk
  736. emulation ones from accessing guest disk images. Similarly, network
  737. emulation processes can have a type separate from the main QEMU process
  738. and non-network emulation process, and only that type can access the
  739. host tun/tap device used to provide guest networking.
  740. Category enforcement
  741. ^^^^^^^^^^^^^^^^^^^^
  742. Category enforcement assigns a set of numbers within a given range to
  743. the process or file. The process is granted access to the file if the
  744. process's set is a superset of the file's set. This enforcement can be
  745. used to separate multiple instances of devices in the same class.
  746. For example, if there are multiple disk devices provides to a guest,
  747. each device emulation process could be provisioned with a separate
  748. category. The different device emulation processes would not be able to
  749. access each other's backing disk images.
  750. Alternatively, categories could be used in lieu of the type enforcement
  751. scheme described above. In this scenario, different categories would be
  752. used to prevent device emulation processes in different classes from
  753. accessing resources assigned to other classes.