ios.rst 6.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
  1. ===========
  2. iOS Support
  3. ===========
  4. To run qemu on the iOS platform, some modifications were required. Most of the
  5. modifications are conditioned on the ``CONFIG_IOS`` and ``CONFIG_NO_RWX``
  6. configuration variables.
  7. Build support
  8. -------------
  9. For the code to compile, certain changes in the block driver and the slirp
  10. driver had to be made. There is no ``system()`` call, so code requiring it had
  11. to be disabled.
  12. ``ucontext`` support is broken on iOS. The implementation from ``libucontext``
  13. is used instead.
  14. Because ``fork()`` is not allowed on iOS apps, the option to build qemu and the
  15. utilities as shared libraries is added. Note that because qemu does not perform
  16. resource cleanup in most cases (open files, allocated memory, etc), it is
  17. advisable that the user implements a proxy layer for syscalls so resources can
  18. be kept track by the app that uses qemu as a shared library.
  19. Executable memory locking
  20. -------------------------
  21. The iOS kernel does not permit ``mmap()`` pages with
  22. ``PROT_READ | PROT_WRITE | PROT_EXEC``. However, it does allow allocating pages
  23. with only ``PROT_READ | PROT_WRITE`` and then later calling ``mprotect()`` with
  24. ``PROT_READ | PROT_EXEC``. A page can never be both writable and executable.
  25. In this document, we will refer to a page that is read-writable as "unlocked"
  26. and a page that is read-executable as "locked." Because ``mprotect()`` is an
  27. expensive call, we try to defer calling it until we need to and also try to
  28. avoid calling it unless it is absolutely needed.
  29. One approach would be to unlock the entire TCG region when a TB translation is
  30. being done and then lock the entire region when a TB is about to be executed.
  31. This would require thousands of pages to be locked and unlocked all the time.
  32. Additionally, it means that different vCPU threads cannot share the same TLB
  33. cache.
  34. TB allocation changes
  35. ---------------------
  36. To improve the performance, we first notice that ``tcg_tb_alloc()`` returns a
  37. chunk of memory that must be unlocked. A recent change in qemu places the TB
  38. structure close to the code buffer in order to improve both cache locality and
  39. reduce code size and memory usage. Unfortunately, we have to regress this
  40. improvement as any benefit from it is negated with the need to unlock the memory
  41. whenever we need to mutate the TB structure.
  42. We go back to the old method of statically allocating a large buffer for all
  43. TBs in a region. However a few improvements are made. First, we try to respect
  44. the locality by making this buffer close to the code. Second, whenever we flush
  45. the TB cache, we will use the average size of code blocks to divide up the TCG
  46. region into space for TB structures and space for code blocks.
  47. Locked memory water level
  48. -------------------------
  49. By moving the TB allocation, we made it such that the memory only needs to be
  50. unlocked in the context of ``tb_gen_code()``. Because the code buffer pointer
  51. only grows downwards (we do not ever "free" code blocks and have holes), we
  52. only ever need to unlock at most one page.
  53. We can think of the entire TCG region divided into two sections: the locked
  54. section and the unlocked section. At the start, the entire region is unlocked.
  55. As more and more code blocks are generated, the allocation pointer moves
  56. upwards. We can then lock the memory of any memory below the allocation pointer
  57. as the code generated is immutable. Therefore we can keep a second pointer to
  58. the highest page boundary the allocation pointer passes and keep all the memory
  59. below that pointer (all the way to the start of the region) locked and all the
  60. memory above it unlocked. This pointer is our locked water level.
  61. That way, assuming all pages are unlocked at the start, we will progressively
  62. lock more pages as more code is generated. The only page we ever need to unlock
  63. would be the page pointed to by our locked water level pointer.
  64. In ``tb_gen_code()`` we will call ``mprotect()`` on at most one page in order to
  65. unlock the top of the water level (if it is currently locked). In
  66. ``cpu_tb_exec()`` we will call ``mprotect()`` on all pages below the water
  67. level that are currently unlocked. This will, in most cases, be one or zero
  68. pages with the exception being if multiple pages of code were generated without
  69. being executed.
  70. Multiple threads
  71. ----------------
  72. Additional consideration is needed to handle multiple threads. We do not permit
  73. one vCPU from executing code on another vCPU if the end of the code is located
  74. at it's TCG region's locked water level. The reason is that without having
  75. synchronization between threads, we cannot guarantee if the page at the water
  76. level is locked or unlocked.
  77. There are multiple places this may happen: when a TB is being looked up in the
  78. main loop, when a TB is being looked up as part of ``goto_tb``, and the TB chain
  79. caches (where after lookup, we encode a jump so a future call to the first TB
  80. will immediately jump into the second TB without lookup).
  81. Since adding synchronization is expensive (holding on thread idle while another
  82. one is generating code defeats the purpose of parallel TCG contexts), we
  83. implement a lock-less solution. In each TB, we store a pointer to the water
  84. level pointer. Whenever a TB is looked up, we check that either 1) the TB
  85. belongs to the current thread and therefore we can ensure the memory is locked
  86. during execution or 2) the water level of the TCG context that the TB belongs to
  87. is beyond the end of the TB's code block. This does mean that there might be
  88. redundant code generation done by multiple TCG contexts if multiple vCPUs all
  89. decide to execute the same block of code at the same time. This should not
  90. happen too often.
  91. Similarly, for the TB chain cache, we will only chain a TB if either 1) both
  92. TBs' code buffer end pointer resides in the same page and therefore if the
  93. memory is locked to execute the first TB, we can jump to the second TB without
  94. issue, or 2) the second TB's code block fully resides below the locked water
  95. level of its TCG context. This means in some cases (such as two newly minted
  96. TBs from two threads happen to be chained), we will not chain the TB when we
  97. see it initially but will only chain it after a few subsequent executions and
  98. the locked water level has risen.