123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215 |
- .. _tcg_internals:
- ====================
- Translator Internals
- ====================
- QEMU is a dynamic translator. When it first encounters a piece of code,
- it converts it to the host instruction set. Usually dynamic translators
- are very complicated and highly CPU dependent. QEMU uses some tricks
- which make it relatively easily portable and simple while achieving good
- performances.
- QEMU's dynamic translation backend is called TCG, for "Tiny Code
- Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
- The following sections outline some notable features and implementation
- details of QEMU's dynamic translator.
- CPU state optimisations
- -----------------------
- The target CPUs have many internal states which change the way they
- evaluate instructions. In order to achieve a good speed, the
- translation phase considers that some state information of the virtual
- CPU cannot change in it. The state is recorded in the Translation
- Block (TB). If the state changes (e.g. privilege level), a new TB will
- be generated and the previous TB won't be used anymore until the state
- matches the state recorded in the previous TB. The same idea can be applied
- to other aspects of the CPU state. For example, on x86, if the SS,
- DS and ES segments have a zero base, then the translator does not even
- generate an addition for the segment base.
- Direct block chaining
- ---------------------
- After each translated basic block is executed, QEMU uses the simulated
- Program Counter (PC) and other CPU state information (such as the CS
- segment base value) to find the next basic block.
- In its simplest, less optimized form, this is done by exiting from the
- current TB, going through the TB epilogue, and then back to the
- main loop. That’s where QEMU looks for the next TB to execute,
- translating it from the guest architecture if it isn’t already available
- in memory. Then QEMU proceeds to execute this next TB, starting at the
- prologue and then moving on to the translated instructions.
- Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
- callback to be re-evaluated before executing additional instructions.
- It is mandatory to exit this way after any CPU state changes that may
- unmask interrupts.
- In order to accelerate the cases where the TB for the new
- simulated PC is already available, QEMU has mechanisms that allow
- multiple TBs to be chained directly, without having to go back to the
- main loop as described above. These mechanisms are:
- ``lookup_and_goto_ptr``
- ^^^^^^^^^^^^^^^^^^^^^^^
- Calling ``tcg_gen_lookup_and_goto_ptr()`` will emit a call to
- ``helper_lookup_tb_ptr``. This helper will look for an existing TB that
- matches the current CPU state. If the destination TB is available its
- code address is returned, otherwise the address of the JIT epilogue is
- returned. The call to the helper is always followed by the tcg ``goto_ptr``
- opcode, which branches to the returned address. In this way, we either
- branch to the next TB or return to the main loop.
- ``goto_tb + exit_tb``
- ^^^^^^^^^^^^^^^^^^^^^
- The translation code usually implements branching by performing the
- following steps:
- 1. Call ``tcg_gen_goto_tb()`` passing a jump slot index (either 0 or 1)
- as a parameter.
- 2. Emit TCG instructions to update the CPU state with any information
- that has been assumed constant and is required by the main loop to
- correctly locate and execute the next TB. For most guests, this is
- just the PC of the branch destination, but others may store additional
- data. The information updated in this step must be inferable from both
- ``cpu_get_tb_cpu_state()`` and ``cpu_restore_state()``.
- 3. Call ``tcg_gen_exit_tb()`` passing the address of the current TB and
- the jump slot index again.
- Step 1, ``tcg_gen_goto_tb()``, will emit a ``goto_tb`` TCG
- instruction that later on gets translated to a jump to an address
- associated with the specified jump slot. Initially, this is the address
- of step 2's instructions, which update the CPU state information. Step 3,
- ``tcg_gen_exit_tb()``, exits from the current TB returning a tagged
- pointer composed of the last executed TB’s address and the jump slot
- index.
- The first time this whole sequence is executed, step 1 simply jumps
- to step 2. Then the CPU state information gets updated and we exit from
- the current TB. As a result, the behavior is very similar to the less
- optimized form described earlier in this section.
- Next, the main loop looks for the next TB to execute using the
- current CPU state information (creating the TB if it wasn’t already
- available) and, before starting to execute the new TB’s instructions,
- patches the previously executed TB by associating one of its jump
- slots (the one specified in the call to ``tcg_gen_exit_tb()``) with the
- address of the new TB.
- The next time this previous TB is executed and we get to that same
- ``goto_tb`` step, it will already be patched (assuming the destination TB
- is still in memory) and will jump directly to the first instruction of
- the destination TB, without going back to the main loop.
- For the ``goto_tb + exit_tb`` mechanism to be used, the following
- conditions need to be satisfied:
- * The change in CPU state must be constant, e.g., a direct branch and
- not an indirect branch.
- * The direct branch cannot cross a page boundary. Memory mappings
- may change, causing the code at the destination address to change.
- Note that, on step 3 (``tcg_gen_exit_tb()``), in addition to the
- jump slot index, the address of the TB just executed is also returned.
- This address corresponds to the TB that will be patched; it may be
- different than the one that was directly executed from the main loop
- if the latter had already been chained to other TBs.
- Self-modifying code and translated code invalidation
- ----------------------------------------------------
- Self-modifying code is a special challenge in x86 emulation because no
- instruction cache invalidation is signaled by the application when code
- is modified.
- User-mode emulation marks a host page as write-protected (if it is
- not already read-only) every time translated code is generated for a
- basic block. Then, if a write access is done to the page, Linux raises
- a SEGV signal. QEMU then invalidates all the translated code in the page
- and enables write accesses to the page. For system emulation, write
- protection is achieved through the software MMU.
- Correct translated code invalidation is done efficiently by maintaining
- a linked list of every translated block contained in a given page. Other
- linked lists are also maintained to undo direct block chaining.
- On RISC targets, correctly written software uses memory barriers and
- cache flushes, so some of the protection above would not be
- necessary. However, QEMU still requires that the generated code always
- matches the target instructions in memory in order to handle
- exceptions correctly.
- Exception support
- -----------------
- longjmp() is used when an exception such as division by zero is
- encountered.
- The host SIGSEGV and SIGBUS signal handlers are used to get invalid
- memory accesses. QEMU keeps a map from host program counter to
- target program counter, and looks up where the exception happened
- based on the host program counter at the exception point.
- On some targets, some bits of the virtual CPU's state are not flushed to the
- memory until the end of the translation block. This is done for internal
- emulation state that is rarely accessed directly by the program and/or changes
- very often throughout the execution of a translation block---this includes
- condition codes on x86, delay slots on SPARC, conditional execution on
- Arm, and so on. This state is stored for each target instruction, and
- looked up on exceptions.
- MMU emulation
- -------------
- For system emulation QEMU uses a software MMU. In that mode, the MMU
- virtual to physical address translation is done at every memory
- access.
- QEMU uses an address translation cache (TLB) to speed up the translation.
- In order to avoid flushing the translated code each time the MMU
- mappings change, all caches in QEMU are physically indexed. This
- means that each basic block is indexed with its physical address.
- In order to avoid invalidating the basic block chain when MMU mappings
- change, chaining is only performed when the destination of the jump
- shares a page with the basic block that is performing the jump.
- The MMU can also distinguish RAM and ROM memory areas from MMIO memory
- areas. Access is faster for RAM and ROM because the translation cache also
- hosts the offset between guest address and host memory. Accessing MMIO
- memory areas instead calls out to C code for device emulation.
- Finally, the MMU helps tracking dirty pages and pages pointed to by
- translation blocks.
- Profiling JITted code
- ---------------------
- The Linux ``perf`` tool will treat all JITted code as a single block as
- unlike the main code it can't use debug information to link individual
- program counter samples with larger functions. To overcome this
- limitation you can use the ``-perfmap`` or the ``-jitdump`` option to generate
- map files. ``-perfmap`` is lightweight and produces only guest-host mappings.
- ``-jitdump`` additionally saves JITed code and guest debug information (if
- available); its output needs to be integrated with the ``perf.data`` file
- before the final report can be viewed.
- .. code::
- perf record $QEMU -perfmap $REMAINING_ARGS
- perf report
- perf record -k 1 $QEMU -jitdump $REMAINING_ARGS
- DEBUGINFOD_URLS= perf inject -j -i perf.data -o perf.data.jitted
- perf report -i perf.data.jitted
- Note that qemu-system generates mappings only for ``-kernel`` files in ELF
- format.
|