123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372 |
- ==============
- The memory API
- ==============
- The memory API models the memory and I/O buses and controllers of a QEMU
- machine. It attempts to allow modelling of:
- - ordinary RAM
- - memory-mapped I/O (MMIO)
- - memory controllers that can dynamically reroute physical memory regions
- to different destinations
- The memory model provides support for
- - tracking RAM changes by the guest
- - setting up coalesced memory for kvm
- - setting up ioeventfd regions for kvm
- Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks
- (leaves) are RAM and MMIO regions, while other nodes represent
- buses, memory controllers, and memory regions that have been rerouted.
- In addition to MemoryRegion objects, the memory API provides AddressSpace
- objects for every root and possibly for intermediate MemoryRegions too.
- These represent memory as seen from the CPU or a device's viewpoint.
- Types of regions
- ----------------
- There are multiple types of memory regions (all represented by a single C type
- MemoryRegion):
- - RAM: a RAM region is simply a range of host memory that can be made available
- to the guest.
- You typically initialize these with memory_region_init_ram(). Some special
- purposes require the variants memory_region_init_resizeable_ram(),
- memory_region_init_ram_from_file(), or memory_region_init_ram_ptr().
- - MMIO: a range of guest memory that is implemented by host callbacks;
- each read or write causes a callback to be called on the host.
- You initialize these with memory_region_init_io(), passing it a
- MemoryRegionOps structure describing the callbacks.
- - ROM: a ROM memory region works like RAM for reads (directly accessing
- a region of host memory), and forbids writes. You initialize these with
- memory_region_init_rom().
- - ROM device: a ROM device memory region works like RAM for reads
- (directly accessing a region of host memory), but like MMIO for
- writes (invoking a callback). You initialize these with
- memory_region_init_rom_device().
- - IOMMU region: an IOMMU region translates addresses of accesses made to it
- and forwards them to some other target memory region. As the name suggests,
- these are only needed for modelling an IOMMU, not for simple devices.
- You initialize these with memory_region_init_iommu().
- - container: a container simply includes other memory regions, each at
- a different offset. Containers are useful for grouping several regions
- into one unit. For example, a PCI BAR may be composed of a RAM region
- and an MMIO region.
- A container's subregions are usually non-overlapping. In some cases it is
- useful to have overlapping regions; for example a memory controller that
- can overlay a subregion of RAM with MMIO or ROM, or a PCI controller
- that does not prevent card from claiming overlapping BARs.
- You initialize a pure container with memory_region_init().
- - alias: a subsection of another region. Aliases allow a region to be
- split apart into discontiguous regions. Examples of uses are memory
- banks used when the guest address space is smaller than the amount
- of RAM addressed, or a memory controller that splits main memory to
- expose a "PCI hole". You can also create aliases to avoid trying to
- add the original region to multiple parents via
- `memory_region_add_subregion`.
- Aliases may point to any type of region, including other aliases,
- but an alias may not point back to itself, directly or indirectly.
- You initialize these with memory_region_init_alias().
- - reservation region: a reservation region is primarily for debugging.
- It claims I/O space that is not supposed to be handled by QEMU itself.
- The typical use is to track parts of the address space which will be
- handled by the host kernel when KVM is enabled. You initialize these
- by passing a NULL callback parameter to memory_region_init_io().
- It is valid to add subregions to a region which is not a pure container
- (that is, to an MMIO, RAM or ROM region). This means that the region
- will act like a container, except that any addresses within the container's
- region which are not claimed by any subregion are handled by the
- container itself (ie by its MMIO callbacks or RAM backing). However
- it is generally possible to achieve the same effect with a pure container
- one of whose subregions is a low priority "background" region covering
- the whole address range; this is often clearer and is preferred.
- Subregions cannot be added to an alias region.
- Migration
- ---------
- Where the memory region is backed by host memory (RAM, ROM and
- ROM device memory region types), this host memory needs to be
- copied to the destination on migration. These APIs which allocate
- the host memory for you will also register the memory so it is
- migrated:
- - memory_region_init_ram()
- - memory_region_init_rom()
- - memory_region_init_rom_device()
- For most devices and boards this is the correct thing. If you
- have a special case where you need to manage the migration of
- the backing memory yourself, you can call the functions:
- - memory_region_init_ram_nomigrate()
- - memory_region_init_rom_nomigrate()
- - memory_region_init_rom_device_nomigrate()
- which only initialize the MemoryRegion and leave handling
- migration to the caller.
- The functions:
- - memory_region_init_resizeable_ram()
- - memory_region_init_ram_from_file()
- - memory_region_init_ram_from_fd()
- - memory_region_init_ram_ptr()
- - memory_region_init_ram_device_ptr()
- are for special cases only, and so they do not automatically
- register the backing memory for migration; the caller must
- manage migration if necessary.
- Region names
- ------------
- Regions are assigned names by the constructor. For most regions these are
- only used for debugging purposes, but RAM regions also use the name to identify
- live migration sections. This means that RAM region names need to have ABI
- stability.
- Region lifecycle
- ----------------
- A region is created by one of the memory_region_init*() functions and
- attached to an object, which acts as its owner or parent. QEMU ensures
- that the owner object remains alive as long as the region is visible to
- the guest, or as long as the region is in use by a virtual CPU or another
- device. For example, the owner object will not die between an
- address_space_map operation and the corresponding address_space_unmap.
- After creation, a region can be added to an address space or a
- container with memory_region_add_subregion(), and removed using
- memory_region_del_subregion().
- Various region attributes (read-only, dirty logging, coalesced mmio,
- ioeventfd) can be changed during the region lifecycle. They take effect
- as soon as the region is made visible. This can be immediately, later,
- or never.
- Destruction of a memory region happens automatically when the owner
- object dies.
- If however the memory region is part of a dynamically allocated data
- structure, you should call object_unparent() to destroy the memory region
- before the data structure is freed. For an example see VFIOMSIXInfo
- and VFIOQuirk in hw/vfio/pci.c.
- You must not destroy a memory region as long as it may be in use by a
- device or CPU. In order to do this, as a general rule do not create or
- destroy memory regions dynamically during a device's lifetime, and only
- call object_unparent() in the memory region owner's instance_finalize
- callback. The dynamically allocated data structure that contains the
- memory region then should obviously be freed in the instance_finalize
- callback as well.
- If you break this rule, the following situation can happen:
- - the memory region's owner had a reference taken via memory_region_ref
- (for example by address_space_map)
- - the region is unparented, and has no owner anymore
- - when address_space_unmap is called, the reference to the memory region's
- owner is leaked.
- There is an exception to the above rule: it is okay to call
- object_unparent at any time for an alias or a container region. It is
- therefore also okay to create or destroy alias and container regions
- dynamically during a device's lifetime.
- This exceptional usage is valid because aliases and containers only help
- QEMU building the guest's memory map; they are never accessed directly.
- memory_region_ref and memory_region_unref are never called on aliases
- or containers, and the above situation then cannot happen. Exploiting
- this exception is rarely necessary, and therefore it is discouraged,
- but nevertheless it is used in a few places.
- For regions that "have no owner" (NULL is passed at creation time), the
- machine object is actually used as the owner. Since instance_finalize is
- never called for the machine object, you must never call object_unparent
- on regions that have no owner, unless they are aliases or containers.
- Overlapping regions and priority
- --------------------------------
- Usually, regions may not overlap each other; a memory address decodes into
- exactly one target. In some cases it is useful to allow regions to overlap,
- and sometimes to control which of an overlapping regions is visible to the
- guest. This is done with memory_region_add_subregion_overlap(), which
- allows the region to overlap any other region in the same container, and
- specifies a priority that allows the core to decide which of two regions at
- the same address are visible (highest wins).
- Priority values are signed, and the default value is zero. This means that
- you can use memory_region_add_subregion_overlap() both to specify a region
- that must sit 'above' any others (with a positive priority) and also a
- background region that sits 'below' others (with a negative priority).
- If the higher priority region in an overlap is a container or alias, then
- the lower priority region will appear in any "holes" that the higher priority
- region has left by not mapping subregions to that area of its address range.
- (This applies recursively -- if the subregions are themselves containers or
- aliases that leave holes then the lower priority region will appear in these
- holes too.)
- For example, suppose we have a container A of size 0x8000 with two subregions
- B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is
- an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two
- of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at
- offset 0x2000. As a diagram::
- 0 1000 2000 3000 4000 5000 6000 7000 8000
- |------|------|------|------|------|------|------|------|
- A: [ ]
- C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
- B: [ ]
- D: [DDDDD]
- E: [EEEEE]
- The regions that will be seen within this address range then are::
- [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC]
- Since B has higher priority than C, its subregions appear in the flat map
- even where they overlap with C. In ranges where B has not mapped anything
- C's region appears.
- If B had provided its own MMIO operations (ie it was not a pure container)
- then these would be used for any addresses in its range not handled by
- D or E, and the result would be::
- [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB]
- Priority values are local to a container, because the priorities of two
- regions are only compared when they are both children of the same container.
- This means that the device in charge of the container (typically modelling
- a bus or a memory controller) can use them to manage the interaction of
- its child regions without any side effects on other parts of the system.
- In the example above, the priorities of D and E are unimportant because
- they do not overlap each other. It is the relative priority of B and C
- that causes D and E to appear on top of C: D and E's priorities are never
- compared against the priority of C.
- Visibility
- ----------
- The memory core uses the following rules to select a memory region when the
- guest accesses an address:
- - all direct subregions of the root region are matched against the address, in
- descending priority order
- - if the address lies outside the region offset/size, the subregion is
- discarded
- - if the subregion is a leaf (RAM or MMIO), the search terminates, returning
- this leaf region
- - if the subregion is a container, the same algorithm is used within the
- subregion (after the address is adjusted by the subregion offset)
- - if the subregion is an alias, the search is continued at the alias target
- (after the address is adjusted by the subregion offset and alias offset)
- - if a recursive search within a container or alias subregion does not
- find a match (because of a "hole" in the container's coverage of its
- address range), then if this is a container with its own MMIO or RAM
- backing the search terminates, returning the container itself. Otherwise
- we continue with the next subregion in priority order
- - if none of the subregions match the address then the search terminates
- with no match found
- Example memory map
- ------------------
- ::
- system_memory: container@0-2^48-1
- |
- +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff)
- |
- +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff)
- |
- +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff)
- | (prio 1)
- |
- +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff)
- pci (0-2^32-1)
- |
- +--- vga-area: container@0xa0000-0xbffff
- | |
- | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff)
- | |
- | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff)
- |
- +---- vram: ram@0xe1000000-0xe1ffffff
- |
- +---- vga-mmio: mmio@0xe2000000-0xe200ffff
- ram: ram@0x00000000-0xffffffff
- This is a (simplified) PC memory map. The 4GB RAM block is mapped into the
- system address space via two aliases: "lomem" is a 1:1 mapping of the first
- 3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the
- so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with
- 4GB of memory.
- The memory controller diverts addresses in the range 640K-768K to the PCI
- address space. This is modelled using the "vga-window" alias, mapped at a
- higher priority so it obscures the RAM at the same addresses. The vga window
- can be removed by programming the memory controller; this is modelled by
- removing the alias and exposing the RAM underneath.
- The pci address space is not a direct child of the system address space, since
- we only want parts of it to be visible (we accomplish this using aliases).
- It has two subregions: vga-area models the legacy vga window and is occupied
- by two 32K memory banks pointing at two sections of the framebuffer.
- In addition the vram is mapped as a BAR at address e1000000, and an additional
- BAR containing MMIO registers is mapped after it.
- Note that if the guest maps a BAR outside the PCI hole, it would not be
- visible as the pci-hole alias clips it to a 0.5GB range.
- MMIO Operations
- ---------------
- MMIO regions are provided with ->read() and ->write() callbacks,
- which are sufficient for most devices. Some devices change behaviour
- based on the attributes used for the memory transaction, or need
- to be able to respond that the access should provoke a bus error
- rather than completing successfully; those devices can use the
- ->read_with_attrs() and ->write_with_attrs() callbacks instead.
- In addition various constraints can be supplied to control how these
- callbacks are called:
- - .valid.min_access_size, .valid.max_access_size define the access sizes
- (in bytes) which the device accepts; accesses outside this range will
- have device and bus specific behaviour (ignored, or machine check)
- - .valid.unaligned specifies that the *device being modelled* supports
- unaligned accesses; if false, unaligned accesses will invoke the
- appropriate bus or CPU specific behaviour.
- - .impl.min_access_size, .impl.max_access_size define the access sizes
- (in bytes) supported by the *implementation*; other access sizes will be
- emulated using the ones available. For example a 4-byte write will be
- emulated using four 1-byte writes, if .impl.max_access_size = 1.
- - .impl.unaligned specifies that the *implementation* supports unaligned
- accesses; if false, unaligned accesses will be emulated by two aligned
- accesses.
- API Reference
- -------------
- .. kernel-doc:: include/exec/memory.h
|