FangSoftSoft
/
qemu
-ын хуулбар https://github.com/utmapp/qemu.git


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
							===========
iOS Support
===========

To run qemu on the iOS platform, some modifications were required. Most of the 
modifications are conditioned on the ``CONFIG_IOS`` and ``CONFIG_NO_RWX`` 
configuration variables.

Build support
-------------

For the code to compile, certain changes in the block driver and the slirp 
driver had to be made. There is no ``system()`` call, so code requiring it had 
to be disabled.

``ucontext`` support is broken on iOS. The implementation from ``libucontext`` 
is used instead.

Because ``fork()`` is not allowed on iOS apps, the option to build qemu and the 
utilities as shared libraries is added. Note that because qemu does not perform 
resource cleanup in most cases (open files, allocated memory, etc), it is 
advisable that the user implements a proxy layer for syscalls so resources can 
be kept track by the app that uses qemu as a shared library.

Executable memory locking
-------------------------

The iOS kernel does not permit ``mmap()`` pages with 
``PROT_READ | PROT_WRITE | PROT_EXEC``. However, it does allow allocating pages 
with only ``PROT_READ | PROT_WRITE`` and then later calling ``mprotect()`` with 
``PROT_READ | PROT_EXEC``. A page can never be both writable and executable.

In this document, we will refer to a page that is read-writable as "unlocked" 
and a page that is read-executable as "locked." Because ``mprotect()`` is an 
expensive call, we try to defer calling it until we need to and also try to 
avoid calling it unless it is absolutely needed.

One approach would be to unlock the entire TCG region when a TB translation is 
being done and then lock the entire region when a TB is about to be executed. 
This would require thousands of pages to be locked and unlocked all the time. 
Additionally, it means that different vCPU threads cannot share the same TLB 
cache.

TB allocation changes
---------------------

To improve the performance, we first notice that ``tcg_tb_alloc()`` returns a 
chunk of memory that must be unlocked. A recent change in qemu places the TB 
structure close to the code buffer in order to improve both cache locality and 
reduce code size and memory usage. Unfortunately, we have to regress this 
improvement as any benefit from it is negated with the need to unlock the memory
whenever we need to mutate the TB structure.

We go back to the old method of statically allocating a large buffer for all 
TBs in a region. However a few improvements are made. First, we try to respect 
the locality by making this buffer close to the code. Second, whenever we flush 
the TB cache, we will use the average size of code blocks to divide up the TCG 
region into space for TB structures and space for code blocks.

Locked memory water level
-------------------------

By moving the TB allocation, we made it such that the memory only needs to be 
unlocked in the context of ``tb_gen_code()``. Because the code buffer pointer 
only grows downwards (we do not ever "free" code blocks and have holes), we 
only ever need to unlock at most one page.

We can think of the entire TCG region divided into two sections: the locked 
section and the unlocked section. At the start, the entire region is unlocked. 
As more and more code blocks are generated, the allocation pointer moves 
upwards. We can then lock the memory of any memory below the allocation pointer 
as the code generated is immutable. Therefore we can keep a second pointer to 
the highest page boundary the allocation pointer passes and keep all the memory 
below that pointer (all the way to the start of the region) locked and all the 
memory above it unlocked. This pointer is our locked water level.

That way, assuming all pages are unlocked at the start, we will progressively 
lock more pages as more code is generated. The only page we ever need to unlock 
would be the page pointed to by our locked water level pointer.

In ``tb_gen_code()`` we will call ``mprotect()`` on at most one page in order to
unlock the top of the water level (if it is currently locked). In 
``cpu_tb_exec()`` we will call ``mprotect()`` on all pages below the water 
level that are currently unlocked. This will, in most cases, be one or zero 
pages with the exception being if multiple pages of code were generated without 
being executed.

Multiple threads
----------------

Additional consideration is needed to handle multiple threads. We do not permit 
one vCPU from executing code on another vCPU if the end of the code is located 
at it's TCG region's locked water level. The reason is that without having 
synchronization between threads, we cannot guarantee if the page at the water 
level is locked or unlocked.

There are multiple places this may happen: when a TB is being looked up in the 
main loop, when a TB is being looked up as part of ``goto_tb``, and the TB chain 
caches (where after lookup, we encode a jump so a future call to the first TB 
will immediately jump into the second TB without lookup).

Since adding synchronization is expensive (holding on thread idle while another 
one is generating code defeats the purpose of parallel TCG contexts), we 
implement a lock-less solution. In each TB, we store a pointer to the water 
level pointer. Whenever a TB is looked up, we check that either 1) the TB 
belongs to the current thread and therefore we can ensure the memory is locked 
during execution or 2) the water level of the TCG context that the TB belongs to
is beyond the end of the TB's code block. This does mean that there might be 
redundant code generation done by multiple TCG contexts if multiple vCPUs all 
decide to execute the same block of code at the same time. This should not 
happen too often.

Similarly, for the TB chain cache, we will only chain a TB if either 1) both 
TBs' code buffer end pointer resides in the same page and therefore if the 
memory is locked to execute the first TB, we can jump to the second TB without 
issue, or 2) the second TB's code block fully resides below the locked water 
level of its TCG context. This means in some cases (such as two newly minted 
TBs from two threads happen to be chained), we will not chain the TB when we 
see it initially but will only chain it after a few subsequent executions and 
the locked water level has risen.