7 роки тому · f32280c8cd
--- a/docs/SpeculativeLoadHardening.md
+++ b/docs/SpeculativeLoadHardening.md
@@ -0,0 +1,1099 @@
 
				+# Speculative Load Hardening
			
 
				+
			
 
				+### A Spectre Variant #1 Mitigation Technique
			
 
				+
			
 
				+Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
			
 
				+
			
 
				+## Problem Statement
			
 
				+
			
 
				+Recently, Google Project Zero and other researchers have found information leak
			
 
				+vulnerabilities by exploiting speculative execution in modern CPUs. These
			
 
				+exploits are currently broken down into three variants:
			
 
				+* GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
			
 
				+* GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
			
 
				+* GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
			
 
				+
			
 
				+For more details, see the Google Project Zero blog post and the Spectre research
			
 
				+paper:
			
 
				+* https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
			
 
				+* https://spectreattack.com/spectre.pdf
			
 
				+
			
 
				+The core problem of GPZ Variant #1 is that speculative execution uses branch
			
 
				+prediction to select the path of instructions speculatively executed. This path
			
 
				+is speculatively executed with the available data, and may load from memory and
			
 
				+leak the loaded values through various side channels that survive even when the
			
 
				+speculative execution is unwound due to being incorrect. Mispredicted paths can
			
 
				+cause code to be executed with data inputs that never occur in correct
			
 
				+executions, making checks against malicious inputs ineffective and allowing
			
 
				+attackers to use malicious data inputs to leak secret data. Here is an example,
			
 
				+extracted and simplified from the Project Zero paper:
			
 
				+```
			
 
				+struct array {
			
 
				+  unsigned long length;
			
 
				+  unsigned char data[];
			
 
				+};
			
 
				+struct array *arr1 = ...; // small array
			
 
				+struct array *arr2 = ...; // array of size 0x400
			
 
				+unsigned long untrusted_offset_from_caller = ...;
			
 
				+if (untrusted_offset_from_caller < arr1->length) {
			
 
				+  unsigned char value = arr1->data[untrusted_offset_from_caller];
			
 
				+  unsigned long index2 = ((value&1)*0x100)+0x200;
			
 
				+  unsigned char value2 = arr2->data[index2];
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The key of the attack is to call this with `untrusted_offset_from_caller` that
			
 
				+is far outside of the bounds when the branch predictor will predict that it
			
 
				+will be in-bounds. In that case, the body of the `if` will be executed
			
 
				+speculatively, and may read secret data into `value` and leak it via a
			
 
				+cache-timing side channel when a dependent access is made to populate `value2`.
			
 
				+
			
 
				+## High Level Mitigation Approach
			
 
				+
			
 
				+While several approaches are being actively pursued to mitigate specific
			
 
				+branches and/or loads inside especially risky software (most notably various OS
			
 
				+kernels), these approaches require manual and/or static analysis aided auditing
			
 
				+of code and explicit source changes to apply the mitigation. They are unlikely
			
 
				+to scale well to large applications. We are proposing a comprehensive
			
 
				+mitigation approach that would apply automatically across an entire program
			
 
				+rather than through manual changes to the code. While this is likely to have a
			
 
				+high performance cost, some applications may be in a good position to take this
			
 
				+performance / security tradeoff.
			
 
				+
			
 
				+The specific technique we propose is to cause loads to be checked using
			
 
				+branchless code to ensure that they are executing along a valid control flow
			
 
				+path. Consider the following C-pseudo-code representing the core idea of a
			
 
				+predicate guarding potentially invalid loads:
			
 
				+```
			
 
				+void leak(int data);
			
 
				+void example(int* pointer1, int* pointer2) {
			
 
				+  if (condition) {
			
 
				+    // ... lots of code ...
			
 
				+    leak(*pointer1);
			
 
				+  } else {
			
 
				+    // ... more code ...
			
 
				+    leak(*pointer2);
			
 
				+  }
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+This would get transformed into something resembling the following:
			
 
				+```
			
 
				+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
			
 
				+uintptr_t all_zeros_mask = 0;
			
 
				+void leak(int data);
			
 
				+void example(int* pointer1, int* pointer2) {
			
 
				+  uintptr_t predicate_state = all_ones_mask;
			
 
				+  if (condition) {
			
 
				+    // Assuming ?: is implemented using branchless logic...
			
 
				+    predicate_state = !condition ? all_zeros_mask : predicate_state;
			
 
				+    // ... lots of code ...
			
 
				+    //
			
 
				+    // Harden the pointer so it can't be loaded
			
 
				+    pointer1 &= predicate_state;
			
 
				+    leak(*pointer1);
			
 
				+  } else {
			
 
				+    predicate_state = condition ? all_zeros_mask : predicate_state;
			
 
				+    // ... more code ...
			
 
				+    //
			
 
				+    // Alternative: Harden the loaded value
			
 
				+    int value2 = *pointer2 & predicate_state;
			
 
				+    leak(value2);
			
 
				+  }
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The result should be that if the `if (condition) {` branch is mis-predicted,
			
 
				+there is a *data* dependency on the condition used to zero out any pointers
			
 
				+prior to loading through them or to zero out all of the loaded bits. Even
			
 
				+though this code pattern may still execute speculatively, *invalid* speculative
			
 
				+executions are prevented from leaking secret data from memory (but note that
			
 
				+this data might still be loaded in safe ways, and some regions of memory are
			
 
				+required to not hold secrets, see below for detailed limitations). This
			
 
				+approach only requires the underlying hardware have a way to implement a
			
 
				+branchless and unpredicted conditional update of a register's value. All modern
			
 
				+architectures have support for this, and in fact such support is necessary to
			
 
				+correctly implement constant time cryptographic primitives.
			
 
				+
			
 
				+Crucial properties of this approach:
			
 
				+* It is not preventing any particular side-channel from working. This is
			
 
				+  important as there are an unknown number of potential side channels and we
			
 
				+  expect to continue discovering more. Instead, it prevents the observation of
			
 
				+  secret data in the first place.
			
 
				+* It accumulates the predicate state, protecting even in the face of nested
			
 
				+  *correctly* predicted control flows.
			
 
				+* It passes this predicate state across function boundaries to provide
			
 
				+  [interprocedural protection](#interprocedural-checking).
			
 
				+* When hardening the address of a load, it uses a *destructive* or
			
 
				+  *non-reversible* modification of the address to prevent an attacker from
			
 
				+  reversing the check using attacker-controlled inputs.
			
 
				+* It does not completely block speculative execution, and merely prevents
			
 
				+  *mis*-speculated paths from leaking secrets from memory (and stalls
			
 
				+  speculation until this can be determined).
			
 
				+* It is completely general and makes no fundamental assumptions about the
			
 
				+  underlying architecture other than the ability to do branchless conditional
			
 
				+  data updates and a lack of value prediction.
			
 
				+* It does not require programmers to identify all possible secret data using
			
 
				+  static source code annotations or code vulnerable to a variant #1 style
			
 
				+  attack.
			
 
				+
			
 
				+Limitations of this approach:
			
 
				+* It requires re-compiling source code to insert hardening instruction
			
 
				+  sequences. Only software compiled in this mode is protected.
			
 
				+* The performance is heavily dependent on a particular architecture's
			
 
				+  implementation strategy. We outline a potential x86 implementation below and
			
 
				+  characterize its performance.
			
 
				+* It does not defend against secret data already loaded from memory and
			
 
				+  residing in registers or leaked through other side-channels in
			
 
				+  non-speculative execution. Code dealing with this, e.g cryptographic
			
 
				+  routines, already uses constant-time algorithms and code to prevent
			
 
				+  side-channels. Such code should also scrub registers of secret data following
			
 
				+  [these
			
 
				+  guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
			
 
				+* To achieve reasonable performance, many loads may not be checked, such as
			
 
				+  those with compile-time fixed addresses. This primarily consists of accesses
			
 
				+  at compile-time constant offsets of global and local variables. Code which
			
 
				+  needs this protection and intentionally stores secret data must ensure the
			
 
				+  memory regions used for secret data are necessarily dynamic mappings or heap
			
 
				+  allocations. This is an area which can be tuned to provide more comprehensive
			
 
				+  protection at the cost of performance.
			
 
				+* [Hardened loads](#hardening-the-address-of-the-load) may still load data from
			
 
				+  _valid_ addresses if not _attacker-controlled_ addresses. To prevent these
			
 
				+  from reading secret data, the low 2gb of the address space and 2gb above and
			
 
				+  below any executable pages should be protected.
			
 
				+
			
 
				+Credit:
			
 
				+* The core idea of tracing misspeculation through data and marking pointers to
			
 
				+  block misspeculated loads was developed as part of a HACS 2018 discussion
			
 
				+  between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
			
 
				+  individuals.
			
 
				+* Core idea of masking out loaded bits was part of the original mitigation
			
 
				+  suggested by Jann Horn when these attacks were reported.
			
 
				+
			
 
				+
			
 
				+### Indirect Branches, Calls, and Returns
			
 
				+
			
 
				+It is possible to attack control flow other than conditional branches with
			
 
				+variant #1 style mispredictions.
			
 
				+* A prediction towards a hot call target of a virtual method can lead to it
			
 
				+  being speculatively executed when an expected type is used (often called
			
 
				+  "type confusion").
			
 
				+* A hot case may be speculatively executed due to prediction instead of the
			
 
				+  correct case for a switch statement implemented as a jump table.
			
 
				+* A hot common return address may be predicted incorrectly when returning from
			
 
				+  a function.
			
 
				+
			
 
				+These code patterns are also vulnerable to Spectre variant #2, and as such are
			
 
				+best mitigated with a
			
 
				+[retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
			
 
				+When a mitigation technique like retpoline is used, speculation simply cannot
			
 
				+proceed through an indirect control flow edge (or it cannot be mispredicted in
			
 
				+the case of a filled RSB) and so it is also protected from variant #1 style
			
 
				+attacks. However, some architectures, micro-architectures, or vendors do not
			
 
				+employ the retpoline mitigation, and on future x86 hardware (both Intel and
			
 
				+AMD) it is expected to become unnecessary due to hardware-based mitigation.
			
 
				+
			
 
				+When not using a retpoline, these edges will need independent protection from
			
 
				+variant #1 style attacks. The analogous approach to that used for conditional
			
 
				+control flow should work:
			
 
				+```
			
 
				+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
			
 
				+uintptr_t all_zeros_mask = 0;
			
 
				+void leak(int data);
			
 
				+void example(int* pointer1, int* pointer2) {
			
 
				+  uintptr_t predicate_state = all_ones_mask;
			
 
				+  switch (condition) {
			
 
				+  case 0:
			
 
				+    // Assuming ?: is implemented using branchless logic...
			
 
				+    predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
			
 
				+    // ... lots of code ...
			
 
				+    //
			
 
				+    // Harden the pointer so it can't be loaded
			
 
				+    pointer1 &= predicate_state;
			
 
				+    leak(*pointer1);
			
 
				+    break;
			
 
				+
			
 
				+  case 1:
			
 
				+    predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
			
 
				+    // ... more code ...
			
 
				+    //
			
 
				+    // Alternative: Harden the loaded value
			
 
				+    int value2 = *pointer2 & predicate_state;
			
 
				+    leak(value2);
			
 
				+    break;
			
 
				+
			
 
				+    // ...
			
 
				+  }
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+The core idea remains the same: validate the control flow using data-flow and
			
 
				+use that validation to check that loads cannot leak information along
			
 
				+misspeculated paths. Typically this involves passing the desired target of such
			
 
				+control flow across the edge and checking that it is correct afterwards. Note
			
 
				+that while it is tempting to think that this mitigates variant #2 attacks, it
			
 
				+does not. Those attacks go to arbitrary gadgets that don't include the checks.
			
 
				+
			
 
				+
			
 
				+### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
			
 
				+
			
 
				+Beyond the core variant #1 attack, there are techniques to extend this attack.
			
 
				+The primary technique is known as "Bounds Check Bypass Store" and is discussed
			
 
				+in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
			
 
				+
			
 
				+We will analyze these two variants independently. First, variant #1.1 works by
			
 
				+speculatively storing over the return address after a bounds check bypass. This
			
 
				+speculative store then ends up being used by the CPU during speculative
			
 
				+execution of the return, potentially directing speculative execution to
			
 
				+arbitrary gadgets in the binary. Let's look at an example.
			
 
				+```
			
 
				+unsigned char local_buffer[4];
			
 
				+unsigned char *untrusted_data_from_caller = ...;
			
 
				+unsigned long untrusted_size_from_caller = ...;
			
 
				+if (untrusted_size_from_caller < sizeof(local_buffer)) {
			
 
				+  // Speculative execution enters here with a too-large size.
			
 
				+  memcpy(local_buffer, untrusted_data_from_caller,
			
 
				+         untrusted_size_from_caller);
			
 
				+  // The stack has now been smashed, writing an attacker-controlled
			
 
				+  // address over the return adress.
			
 
				+  minor_processing(local_buffer);
			
 
				+  return;
			
 
				+  // Control will speculate to the attacker-written address.
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+However, this can be mitigated by hardening the load of the return address just
			
 
				+like any other load. This is sometimes complicated because x86 for example
			
 
				+*implicitly* loads the return address off the stack. However, the
			
 
				+implementation technique below is specifically designed to mitigate this
			
 
				+implicit load by using the stack pointer to communicate misspeculation between
			
 
				+functions. This additionally causes a misspeculation to have an invalid stack
			
 
				+pointer and never be able to read the speculatively stored return address. See
			
 
				+the detailed discussion below.
			
 
				+
			
 
				+For variant #1.2, the attacker speculatively stores into the vtable or jump
			
 
				+table used to implement an indirect call or indirect jump. Because this is
			
 
				+speculative, this will often be possible even when these are stored in
			
 
				+read-only pages. For example:
			
 
				+```
			
 
				+class FancyObject : public BaseObject {
			
 
				+public:
			
 
				+  void DoSomething() override;
			
 
				+};
			
 
				+void f(unsigned long attacker_offset, unsigned long attacker_data) {
			
 
				+  FancyObject object = getMyObject();
			
 
				+  unsigned long *arr[4] = getFourDataPointers();
			
 
				+  if (attacker_offset < 4) {
			
 
				+    // We have bypassed the bounds check speculatively.
			
 
				+    unsigned long *data = arr[attacker_offset];
			
 
				+    // Now we have computed a pointer inside of `object`, the vptr.
			
 
				+    *data = attacker_data;
			
 
				+    // The vptr points to the virtual table and we speculatively clobber that.
			
 
				+    g(object); // Hand the object to some other routine.
			
 
				+  }
			
 
				+}
			
 
				+// In another file, we call a method on the object.
			
 
				+void g(BaseObject &object) {
			
 
				+  object.DoSomething();
			
 
				+  // This speculatively calls the address stored over the vtable.
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+Mitigating this requires hardening loads from these locations, or mitigating
			
 
				+the indirect call or indirect jump. Any of these are sufficient to block the
			
 
				+call or jump from using a speculatively stored value that has been read back.
			
 
				+
			
 
				+For both of these, using retpolines would be equally sufficient. One possible
			
 
				+hybrid approach is to use retpolines for indirect call and jump, while relying
			
 
				+on SLH to mitigate returns.
			
 
				+
			
 
				+Another approach that is sufficient for both of these is to harden all of the
			
 
				+speculative stores. However, as most stores aren't interesting and don't
			
 
				+inherently leak data, this is expected to be prohibitively expensive given the
			
 
				+attack it is defending against.
			
 
				+
			
 
				+
			
 
				+## Implementation Details
			
 
				+
			
 
				+There are a number of complex details impacting the implementation of this
			
 
				+technique, both on a particular architecture and within a particular compiler.
			
 
				+We discuss proposed implementation techniques for the x86 architecture and the
			
 
				+LLVM compiler. These are primarily to serve as an example, as other
			
 
				+implementation techniques are very possible.
			
 
				+
			
 
				+
			
 
				+### x86 Implementation Details
			
 
				+
			
 
				+On the x86 platform we break down the implementation into three core
			
 
				+components: accumulating the predicate state through the control flow graph,
			
 
				+checking the loads, and checking control transfers between procedures.
			
 
				+
			
 
				+
			
 
				+#### Accumulating Predicate State
			
 
				+
			
 
				+Consider baseline x86 instructions like the following, which test three
			
 
				+conditions and if all pass, loads data from memory and potentially leaks it
			
 
				+through some side channel:
			
 
				+```
			
 
				+# %bb.0:                                # %entry
			
 
				+        pushq   %rax
			
 
				+        testl   %edi, %edi
			
 
				+        jne     .LBB0_4
			
 
				+# %bb.1:                                # %then1
			
 
				+        testl   %esi, %esi
			
 
				+        jne     .LBB0_4
			
 
				+# %bb.2:                                # %then2
			
 
				+        testl   %edx, %edx
			
 
				+        je      .LBB0_3
			
 
				+.LBB0_4:                                # %exit
			
 
				+        popq    %rax
			
 
				+        retq
			
 
				+.LBB0_3:                                # %danger
			
 
				+        movl    (%rcx), %edi
			
 
				+        callq   leak
			
 
				+        popq    %rax
			
 
				+        retq
			
 
				+```
			
 
				+
			
 
				+When we go to speculatively execute the load, we want to know whether any of
			
 
				+the dynamically executed predicates have been misspeculated. To track that,
			
 
				+along each conditional edge, we need to track the data which would allow that
			
 
				+edge to be taken. On x86, this data is stored in the flags register used by the
			
 
				+conditional jump instruction. Along both edges after this fork in control flow,
			
 
				+the flags register remains alive and contains data that we can use to build up
			
 
				+our accumulated predicate state. We accumulate it using the x86 conditional
			
 
				+move instruction which also reads the flag registers where the state resides.
			
 
				+These conditional move instructions are known to not be predicted on any x86
			
 
				+processors, making them immune to misprediction that could reintroduce the
			
 
				+vulnerability. When we insert the conditional moves, the code ends up looking
			
 
				+like the following:
			
 
				+```
			
 
				+# %bb.0:                                # %entry
			
 
				+        pushq   %rax
			
 
				+        xorl    %eax, %eax              # Zero out initial predicate state.
			
 
				+        movq    $-1, %r8                # Put all-ones mask into a register.
			
 
				+        testl   %edi, %edi
			
 
				+        jne     .LBB0_1
			
 
				+# %bb.2:                                # %then1
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        testl   %esi, %esi
			
 
				+        jne     .LBB0_1
			
 
				+# %bb.3:                                # %then2
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        testl   %edx, %edx
			
 
				+        je      .LBB0_4
			
 
				+.LBB0_1:
			
 
				+        cmoveq  %r8, %rax               # Conditionally update predicate state.
			
 
				+        popq    %rax
			
 
				+        retq
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        ...
			
 
				+```
			
 
				+
			
 
				+Here we create the "empty" or "correct execution" predicate state by zeroing
			
 
				+`%rax`, and we create a constant "incorrect execution" predicate value by
			
 
				+putting `-1` into `%r8`. Then, along each edge coming out of a conditional
			
 
				+branch we do a conditional move that in a correct execution will be a no-op,
			
 
				+but if misspeculated, will replace the `%rax` with the value of `%r8`.
			
 
				+Misspeculating any one of the three predicates will cause `%rax` to hold the
			
 
				+"incorrect execution" value from `%r8` as we preserve incoming values when
			
 
				+execution is correct rather than overwriting it.
			
 
				+
			
 
				+We now have a value in `%rax` in each basic block that indicates if at some
			
 
				+point previously a predicate was mispredicted. And we have arranged for that
			
 
				+value to be particularly effective when used below to harden loads.
			
 
				+
			
 
				+
			
 
				+##### Indirect Call, Branch, and Return Predicates
			
 
				+
			
 
				+(Not yet implemented.)
			
 
				+
			
 
				+There is no analogous flag to use when tracing indirect calls, branches, and
			
 
				+returns. The predicate state must be accumulated through some other means.
			
 
				+Fundamentally, this is the reverse of the problem posed in CFI: we need to
			
 
				+check where we came from rather than where we are going. For function-local
			
 
				+jump tables, this is easily arranged by testing the input to the jump table
			
 
				+within each destination:
			
 
				+```
			
 
				+        pushq   %rax
			
 
				+        xorl    %eax, %eax              # Zero out initial predicate state.
			
 
				+        movq    $-1, %r8                # Put all-ones mask into a register.
			
 
				+        jmpq    *.LJTI0_0(,%rdi,8)      # Indirect jump through table.
			
 
				+.LBB0_2:                                # %sw.bb
			
 
				+        testq   $0, %rdi                # Validate index used for jump table.
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        ...
			
 
				+        jmp     _Z4leaki                # TAILCALL
			
 
				+
			
 
				+.LBB0_3:                                # %sw.bb1
			
 
				+        testq   $1, %rdi                # Validate index used for jump table.
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        ...
			
 
				+        jmp     _Z4leaki                # TAILCALL
			
 
				+
			
 
				+.LBB0_5:                                # %sw.bb10
			
 
				+        testq   $2, %rdi                # Validate index used for jump table.
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        ...
			
 
				+        jmp     _Z4leaki                # TAILCALL
			
 
				+        ...
			
 
				+
			
 
				+        .section        .rodata,"a",@progbits
			
 
				+        .p2align        3
			
 
				+.LJTI0_0:
			
 
				+        .quad   .LBB0_2
			
 
				+        .quad   .LBB0_3
			
 
				+        .quad   .LBB0_5
			
 
				+        ...
			
 
				+```
			
 
				+
			
 
				+Returns have a simple mitigation technique on x86-64 (or other ABIs which have
			
 
				+what is called a "red zone" region beyond the end of the stack). This region is
			
 
				+guaranteed to be preserved across interrupts and context switches, making the
			
 
				+return address used in returning to the current code remain on the stack and
			
 
				+valid to read. We can emit code in the caller to verify that a return edge was
			
 
				+not mispredicted:
			
 
				+```
			
 
				+        callq   other_function
			
 
				+return_addr:
			
 
				+        testq   -8(%rsp), return_addr   # Validate return address.
			
 
				+        cmovneq %r8, %rax               # Update predicate state.
			
 
				+```
			
 
				+
			
 
				+For an ABI without a "red zone" (and thus unable to read the return address
			
 
				+from the stack), mitigating returns face similar problems to calls below.
			
 
				+
			
 
				+Indirect calls (and returns in the absence of a red zone ABI) pose the most
			
 
				+significant challenge to propagate. The simplest technique would be to define a
			
 
				+new ABI such that the intended call target is passed into the called function
			
 
				+and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
			
 
				+in C and C++. While the target function could be passed in TLS, we would still
			
 
				+require complex logic to handle a mixture of functions compiled with and
			
 
				+without this extra logic (essentially, making the ABI backwards compatible).
			
 
				+Currently, we suggest using retpolines here and will continue to investigate
			
 
				+ways of mitigating this.
			
 
				+
			
 
				+
			
 
				+##### Optimizations, Alternatives, and Tradeoffs
			
 
				+
			
 
				+Merely accumulating predicate state involves significant cost. There are
			
 
				+several key optimizations we employ to minimize this and various alternatives
			
 
				+that present different tradeoffs in the generated code.
			
 
				+
			
 
				+First, we work to reduce the number of instructions used to track the state:
			
 
				+* Rather than inserting a `cmovCC` instruction along every conditional edge in
			
 
				+  the original program, we track each set of condition flags we need to capture
			
 
				+  prior to entering each basic block and reuse a common `cmovCC` sequence for
			
 
				+  those.
			
 
				+  * We could further reuse suffixes when there are multiple `cmovCC`
			
 
				+    instructions required to capture the set of flags. Currently this is
			
 
				+    believed to not be worth the cost as paired flags are relatively rare and
			
 
				+    suffixes of them are exceedingly rare.
			
 
				+* A common pattern in x86 is to have multiple conditional jump instructions
			
 
				+  that use the same flags but handle different conditions. Naively, we could
			
 
				+  consider each fallthrough between them an "edge" but this causes a much more
			
 
				+  complex control flow graph. Instead, we accumulate the set of conditions
			
 
				+  necessary for fallthrough and use a sequence of `cmovCC` instructions in a
			
 
				+  single fallthrough edge to track it.
			
 
				+
			
 
				+Second, we trade register pressure for simpler `cmovCC` instructions by
			
 
				+allocating a register for the "bad" state. We could read that value from memory
			
 
				+as part of the conditional move instruction, however, this creates more
			
 
				+micro-ops and requires the load-store unit to be involved. Currently, we place
			
 
				+the value into a virtual register and allow the register allocator to decide
			
 
				+when the register pressure is sufficient to make it worth spilling to memory
			
 
				+and reloading.
			
 
				+
			
 
				+
			
 
				+#### Hardening Loads
			
 
				+
			
 
				+Once we have the predicate accumulated into a special value for correct vs.
			
 
				+misspeculated, we need to apply this to loads in a way that ensures they do not
			
 
				+leak secret data. There are two primary techniques for this: we can either
			
 
				+harden the loaded value to prevent observation, or we can harden the address
			
 
				+itself to prevent the load from occuring. These have significantly different
			
 
				+performance tradeoffs.
			
 
				+
			
 
				+
			
 
				+##### Hardening loaded values
			
 
				+
			
 
				+The most appealing way to harden loads is to mask out all of the bits loaded.
			
 
				+The key requirement is that for each bit loaded, along the misspeculated path
			
 
				+that bit is always fixed at either 0 or 1 regardless of the value of the bit
			
 
				+loaded. The most obvious implementation uses either an `and` instruction with
			
 
				+an all-zero mask along misspeculated paths and an all-one mask along correct
			
 
				+paths, or an `or` instruction with an all-one mask along misspeculated paths
			
 
				+and an all-zero mask along correct paths. Other options become less appealing
			
 
				+such as multiplying by zero, or multiple shift instructions. For reasons we
			
 
				+elaborate on below, we end up suggesting you use `or` with an all-ones mask,
			
 
				+making the x86 instruction sequence look like the following:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        movl    (%rsi), %edi            # Load potentially secret data from %rsi.
			
 
				+        orl     %eax, %edi
			
 
				+```
			
 
				+
			
 
				+Other useful patterns may be to fold the load into the `or` instruction itself
			
 
				+at the cost of a register-to-register copy.
			
 
				+
			
 
				+There are some challenges with deploying this approach:
			
 
				+1. Many loads on x86 are folded into other instructions. Separating them would
			
 
				+   add very significant and costly register pressure with prohibitive
			
 
				+   performance cost.
			
 
				+1. Loads may not target a general purpose register requiring extra instructions
			
 
				+   to map the state value into the correct register class, and potentially more
			
 
				+   expensive instructions to mask the value in some way.
			
 
				+1. The flags registers on x86 are very likely to be live, and challenging to
			
 
				+   preserve cheaply.
			
 
				+1. There are many more values loaded than pointers & indices used for loads. As
			
 
				+   a consequence, hardening the result of a load requires substantially more
			
 
				+   instructions than hardening the address of the load (see below).
			
 
				+
			
 
				+Despite these challenges, hardening the result of the load critically allows
			
 
				+the load to proceed and thus has dramatically less impact on the total
			
 
				+speculative / out-of-order potential of the execution. There are also several
			
 
				+interesting techniques to try and mitigate these challenges and make hardening
			
 
				+the results of loads viable in at least some cases. However, we generally
			
 
				+expect to fall back when unprofitable from hardening the loaded value to the
			
 
				+next approach of hardening the address itself.
			
 
				+
			
 
				+
			
 
				+###### Loads folded into data-invariant operations can be hardened after the operation
			
 
				+
			
 
				+The first key to making this feasible is to recognize that many operations on
			
 
				+x86 are "data-invariant". That is, they have no (known) observable behavior
			
 
				+differences due to the particular input data. These instructions are often used
			
 
				+when implementing cryptographic primitives dealing with private key data
			
 
				+because they are not believed to provide any side-channels. Similarly, we can
			
 
				+defer hardening until after them as they will not in-and-of-themselves
			
 
				+introduce a speculative execution side-channel. This results in code sequences
			
 
				+that look like:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        addl    (%rsi), %edi            # Load and accumulate without leaking.
			
 
				+        orl     %eax, %edi
			
 
				+```
			
 
				+
			
 
				+While an addition happens to the loaded (potentially secret) value, that
			
 
				+doesn't leak any data and we then immediately harden it.
			
 
				+
			
 
				+
			
 
				+###### Hardening of loaded values deferred down the data-invariant expression graph
			
 
				+
			
 
				+We can generalize the previous idea and sink the hardening down the expression
			
 
				+graph across as many data-invariant operations as desirable. This can use very
			
 
				+conservative rules for whether something is data-invariant. The primary goal
			
 
				+should be to handle multiple loads with a single hardening instruction:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        addl    (%rsi), %edi            # Load and accumulate without leaking.
			
 
				+        addl    4(%rsi), %edi           # Continue without leaking.
			
 
				+        addl    8(%rsi), %edi
			
 
				+        orl     %eax, %edi              # Mask out bits from all three loads.
			
 
				+```
			
 
				+
			
 
				+
			
 
				+###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
			
 
				+
			
 
				+Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
			
 
				+without touching the flag registers. However, we can harden loaded values that
			
 
				+are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
			
 
				+64-bits on 64-bit systems) by zero-extending the value to the full word size
			
 
				+and then shifting right by at least the number of original bits using the BMI2
			
 
				+`shrx` instruction:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        addl    (%rsi), %edi            # Load and accumulate 32 bits of data.
			
 
				+        shrxq   %rax, %rdi, %rdi        # Shift out all 32 bits loaded.
			
 
				+```
			
 
				+
			
 
				+Because on x86 the zero-extend is free, this can efficiently harden the loaded
			
 
				+value.
			
 
				+
			
 
				+
			
 
				+##### Hardening the address of the load
			
 
				+
			
 
				+When hardening the loaded value is inapplicable, most often because the
			
 
				+instruction directly leaks information (like `cmp` or `jmpq`), we switch to
			
 
				+hardening the _address_ of the load instead of the loaded value. This avoids
			
 
				+increasing register pressure by unfolding the load or paying some other high
			
 
				+cost.
			
 
				+
			
 
				+To understand how this works in practice, we need to examine the exact
			
 
				+semantics of the x86 addressing modes which, in its fully general form, looks
			
 
				+like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
			
 
				+registers that can potentially be any value, and may be attacker controlled,
			
 
				+and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
			
 
				+`4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
			
 
				+computation performed to find the address is then: `%base + (scale * %index) +
			
 
				+offset` under 64-bit 2's complement modular arithmetic.
			
 
				+
			
 
				+One issue with this approach is that, after hardening, the  `%base + (scale *
			
 
				+%index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
			
 
				+then a large, positive `offset` will index into memory within the first two
			
 
				+gigabytes of address space. While these offsets are not attacker controlled,
			
 
				+the attacker could chose to attack a load which happens to have the desired
			
 
				+offset and then successfully read memory in that region. This significantly
			
 
				+raises the burden on the attacker and limits the scope of attack but does not
			
 
				+eliminate it. To fully close the attack we must work with the operating system
			
 
				+to preclude mapping memory in the low two gigabytes of address space.
			
 
				+
			
 
				+
			
 
				+###### 64-bit load checking instructions
			
 
				+
			
 
				+We can use the following instruction sequences to check loads. We set up `%r8`
			
 
				+in these examples to hold the special value of `-1` which will be `cmov`ed over
			
 
				+`%rax` in misspeculated paths.
			
 
				+
			
 
				+Single register addressing mode:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        orq     %rax, %rsi              # Mask the pointer if misspeculating.
			
 
				+        movl    (%rsi), %edi
			
 
				+```
			
 
				+
			
 
				+Two register addressing mode:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        orq     %rax, %rsi              # Mask the pointer if misspeculating.
			
 
				+        orq     %rax, %rcx              # Mask the index if misspeculating.
			
 
				+        movl    (%rsi,%rcx), %edi
			
 
				+```
			
 
				+
			
 
				+This will result in a negative address near zero or in `offset` wrapping the
			
 
				+address space back to a small positive address. Small, negative addresses will
			
 
				+fault in user-mode for most operating systems, but targets which need the high
			
 
				+address space to be user accessible may need to adjust the exact sequence used
			
 
				+above. Additionally, the low addresses will need to be marked unreadable by the
			
 
				+OS to fully harden the load.
			
 
				+
			
 
				+
			
 
				+###### RIP-relative addressing is even easier to break
			
 
				+
			
 
				+There is a common addressing mode idiom that is substantially harder to check:
			
 
				+addressing relative to the instruction pointer. We cannot change the value of
			
 
				+the instruction pointer register and so we have the harder problem of forcing
			
 
				+`%base + scale * %index + offset` to be an invalid address, by *only* changing
			
 
				+`%index`. The only advantage we have is that the attacker also cannot modify
			
 
				+`%base`. If we use the fast instruction sequence above, but only apply it to
			
 
				+the index, we will always access `%rip + (scale * -1) + offset`. If the
			
 
				+attacker can find a load which with this address happens to point to secret
			
 
				+data, then they can reach it. However, the loader and base libraries can also
			
 
				+simply refuse to map the heap, data segments, or stack within 2gb of any of the
			
 
				+text in the program, much like it can reserve the low 2gb of address space.
			
 
				+
			
 
				+
			
 
				+###### The flag registers again make everything hard
			
 
				+
			
 
				+Unfortunately, the technique of using `orq`-instructions has a serious flaw on
			
 
				+x86. The very thing that makes it easy to accumulate state, the flag registers
			
 
				+containing predicates, causes serious problems here because they may be alive
			
 
				+and used by the loading instruction or subsequent instructions. On x86, the
			
 
				+`orq` instruction **sets** the flags and will override anything already there.
			
 
				+This makes inserting them into the instruction stream very hazardous.
			
 
				+Unfortunately, unlike when hardening the loaded value, we have no fallback here
			
 
				+and so we must have a fully general approach available.
			
 
				+
			
 
				+The first thing we must do when generating these sequences is try to analyze
			
 
				+the surrounding code to prove that the flags are not in fact alive or being
			
 
				+used. Typically, it has been set by some other instruction which just happens
			
 
				+to set the flags register (much like ours!) with no actual dependency. In those
			
 
				+cases, it is safe to directly insert these instructions. Alternatively we may
			
 
				+be able to move them earlier to avoid clobbering the used value.
			
 
				+
			
 
				+However, this may ultimately be impossible. In that case, we need to preserve
			
 
				+the flags around these instructions:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        pushfq
			
 
				+        orq     %rax, %rcx              # Mask the pointer if misspeculating.
			
 
				+        orq     %rax, %rdx              # Mask the index if misspeculating.
			
 
				+        popfq
			
 
				+        movl    (%rcx,%rdx), %edi
			
 
				+```
			
 
				+
			
 
				+Using the `pushf` and `popf` instructions saves the flags register around our
			
 
				+inserted code, but comes at a high cost. First, we must store the flags to the
			
 
				+stack and reload them. Second, this causes the stack pointer to be adjusted
			
 
				+dynamically, requiring a frame pointer be used for referring to temporaries
			
 
				+spilled to the stack, etc.
			
 
				+
			
 
				+On newer x86 processors we can use the `lahf` and `sahf` instructions to save
			
 
				+all of the flags besides the overflow flag in a register rather than on the
			
 
				+stack. We can then use `seto` and `add` to save and restore the overflow flag
			
 
				+in a register. Combined, this will save and restore flags in the same manner as
			
 
				+above but using two registers rather than the stack. That is still very
			
 
				+expensive if slightly less expensive than `pushf` and `popf` in most cases.
			
 
				+
			
 
				+
			
 
				+###### A flag-less alternative on Haswell, Zen and newer processors
			
 
				+
			
 
				+Starting with the BMI2 x86 instruction set extensions available on Haswell and
			
 
				+Zen processors, there is an instruction for shifting that does not set any
			
 
				+flags: `shrx`. We can use this and the `lea` instruction to implement analogous
			
 
				+code sequences to the above ones. However, these are still very marginally
			
 
				+slower, as there are fewer ports able to dispatch shift instructions in most
			
 
				+modern x86 processors than there are for `or` instructions.
			
 
				+
			
 
				+Fast, single register addressing mode:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        shrxq   %rax, %rsi, %rsi        # Shift away bits if misspeculating.
			
 
				+        movl    (%rsi), %edi
			
 
				+```
			
 
				+
			
 
				+This will collapse the register to zero or one, and everything but the offset
			
 
				+in the addressing mode to be less than or equal to 9. This means the full
			
 
				+address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
			
 
				+to protect an extra page of the low address space to account for this
			
 
				+
			
 
				+
			
 
				+##### Optimizations
			
 
				+
			
 
				+A very large portion of the cost for this approach comes from checking loads in
			
 
				+this way, so it is important to work to optimize this. However, beyond making
			
 
				+the instruction sequences to *apply* the checks efficient (for example by
			
 
				+avoiding `pushfq` and `popfq` sequences), the only significant optimization is
			
 
				+to check fewer loads without introducing a vulnerability. We apply several
			
 
				+techniques to accomplish that.
			
 
				+
			
 
				+
			
 
				+###### Don't check loads from compile-time constant stack offsets
			
 
				+
			
 
				+We implement this optimization on x86 by skipping the checking of loads which
			
 
				+use a fixed frame pointer offset.
			
 
				+
			
 
				+The result of this optimization is that patterns like reloading a spilled
			
 
				+register or accessing a global field don't get checked. This is a very
			
 
				+significant performance win.
			
 
				+
			
 
				+
			
 
				+###### Don't check dependent loads
			
 
				+
			
 
				+A core part of why this mitigation strategy works is that it establishes a
			
 
				+data-flow check on the loaded address. However, this means that if the address
			
 
				+itself was already loaded using a checked load, there is no need to check a
			
 
				+dependent load provided it is within the same basic block as the checked load,
			
 
				+and therefore has no additional predicates guarding it. Consider code like the
			
 
				+following:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        movq    (%rcx), %rdi
			
 
				+        movl    (%rdi), %edx
			
 
				+```
			
 
				+
			
 
				+This will get transformed into:
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        orq     %rax, %rcx              # Mask the pointer if misspeculating.
			
 
				+        movq    (%rcx), %rdi            # Hardened load.
			
 
				+        movl    (%rdi), %edx            # Unhardened load due to dependent addr.
			
 
				+```
			
 
				+
			
 
				+This doesn't check the load through `%rdi` as that pointer is dependent on a
			
 
				+checked load already.
			
 
				+
			
 
				+
			
 
				+###### Protect large, load-heavy blocks with a single lfence
			
 
				+
			
 
				+It may be worth using a single `lfence` instruction at the start of a block
			
 
				+which begins with a (very) large number of loads that require independent
			
 
				+protection *and* which require hardening the address of the load. However, this
			
 
				+is unlikely to be profitable in practice. The latency hit of the hardening
			
 
				+would need to exceed that of an `lfence` when *correctly* speculatively
			
 
				+executed. But in that case, the `lfence` cost is a complete loss of speculative
			
 
				+execution (at a minimum). So far, the evidence we have of the performance cost
			
 
				+of using `lfence` indicates few if any hot code patterns where this trade off
			
 
				+would make sense.
			
 
				+
			
 
				+
			
 
				+###### Tempting optimizations that break the security model
			
 
				+
			
 
				+Several optimizations were considered which didn't pan out due to failure to
			
 
				+uphold the security model. One in particular is worth discussing as many others
			
 
				+will reduce to it.
			
 
				+
			
 
				+We wondered whether only the *first* load in a basic block could be checked. If
			
 
				+the check works as intended, it forms an invalid pointer that doesn't even
			
 
				+virtual-address translate in the hardware. It should fault very early on in its
			
 
				+processing. Maybe that would stop things in time for the misspeculated path to
			
 
				+fail to leak any secrets. This doesn't end up working because the processor is
			
 
				+fundamentally out-of-order, even in its speculative domain. As a consequence,
			
 
				+the attacker could cause the initial address computation itself to stall and
			
 
				+allow an arbitrary number of unrelated loads (including attacked loads of
			
 
				+secret data) to pass through.
			
 
				+
			
 
				+
			
 
				+#### Interprocedural Checking
			
 
				+
			
 
				+Modern x86 processors may speculate into called functions and out of functions
			
 
				+to their return address. As a consequence, we need a way to check loads that
			
 
				+occur after a misspeculated predicate but where the load and the misspeculated
			
 
				+predicate are in different functions. In essence, we need some interprocedural
			
 
				+generalization of the predicate state tracking. A primary challenge to passing
			
 
				+the predicate state between functions is that we would like to not require a
			
 
				+change to the ABI or calling convention in order to make this mitigation more
			
 
				+deployable, and further would like code mitigated in this way to be easily
			
 
				+mixed with code not mitigated in this way and without completely losing the
			
 
				+value of the mitigation.
			
 
				+
			
 
				+
			
 
				+##### Embed the predicate state into the high bit(s) of the stack pointer
			
 
				+
			
 
				+We can use the same technique that allows hardening pointers to pass the
			
 
				+predicate state into and out of functions. The stack pointer is trivially
			
 
				+passed between functions and we can test for it having the high bits set to
			
 
				+detect when it has been marked due to misspeculation. The callsite instruction
			
 
				+sequence looks like (assuming a misspeculated state value of `-1`):
			
 
				+```
			
 
				+        ...
			
 
				+
			
 
				+.LBB0_4:                                # %danger
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        shlq    $47, %rax
			
 
				+        orq     %rax, %rsp
			
 
				+        callq   other_function
			
 
				+        movq    %rsp, %rax
			
 
				+        sarq    63, %rax                # Sign extend the high bit to all bits.
			
 
				+```
			
 
				+
			
 
				+This first puts the predicate state into the high bits of `%rsp` before calling
			
 
				+the function and then reads it back out of high bits of `%rsp` afterward. When
			
 
				+correctly executing (speculatively or not), these are all no-ops. When
			
 
				+misspeculating, the stack pointer will end up negative. We arrange for it to
			
 
				+remain a canonical address, but otherwise leave the low bits alone to allow
			
 
				+stack adjustments to proceed normally without disrupting this. Within the
			
 
				+called function, we can extract this predicate state and then reset it on
			
 
				+return:
			
 
				+```
			
 
				+other_function:
			
 
				+        # prolog
			
 
				+        callq   other_function
			
 
				+        movq    %rsp, %rax
			
 
				+        sarq    63, %rax                # Sign extend the high bit to all bits.
			
 
				+        # ...
			
 
				+
			
 
				+.LBB0_N:
			
 
				+        cmovneq %r8, %rax               # Conditionally update predicate state.
			
 
				+        shlq    $47, %rax
			
 
				+        orq     %rax, %rsp
			
 
				+        retq
			
 
				+```
			
 
				+
			
 
				+This approach is effective when all code is mitigated in this fashion, and can
			
 
				+even survive very limited reaches into unmitigated code (the state will
			
 
				+round-trip in and back out of an unmitigated function, it just won't be
			
 
				+updated). But it does have some limitations. There is a cost to merging the
			
 
				+state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
			
 
				+an unmitigated caller.
			
 
				+
			
 
				+There is also an advantage to using this form of interprocedural mitigation: by
			
 
				+forming these invalid stack pointer addresses we can prevent speculative
			
 
				+returns from successfully reading speculatively written values to the actual
			
 
				+stack. This works first by forming a data-dependency between computing the
			
 
				+address of the return address on the stack and our predicate state. And even
			
 
				+when satisfied, if a misprediction causes the state to be poisoned the
			
 
				+resulting stack pointer will be invalid.
			
 
				+
			
 
				+
			
 
				+##### Rewrite API of internal functions to directly propagate predicate state
			
 
				+
			
 
				+(Not yet implemented.)
			
 
				+
			
 
				+We have the option with internal functions to directly adjust their API to
			
 
				+accept the predicate as an argument and return it. This is likely to be
			
 
				+marginally cheaper than embedding into `%rsp` for entering functions.
			
 
				+
			
 
				+
			
 
				+##### Use `lfence` to guard function transitions
			
 
				+
			
 
				+An `lfence` instruction can be used to prevent subsequent loads from
			
 
				+speculatively executing until all prior mispredicted predicates have resolved.
			
 
				+We can use this broader barrier to speculative loads executing between
			
 
				+functions. We emit it in the entry block to handle calls, and prior to each
			
 
				+return. This approach also has the advantage of providing the strongest degree
			
 
				+of mitigation when mixed with unmitigated code by halting all misspeculation
			
 
				+entering a function which is mitigated, regardless of what occured in the
			
 
				+caller. However, such a mixture is inherently more risky. Whether this kind of
			
 
				+mixture is a sufficient mitigation requires careful analysis.
			
 
				+
			
 
				+Unfortunately, experimental results indicate that the performance overhead of
			
 
				+this approach is very high for certain patterns of code. A classic example is
			
 
				+any form of recursive evaluation engine. The hot, rapid call and return
			
 
				+sequences exhibit dramatic performance loss when mitigated with `lfence`. This
			
 
				+component alone can regress performance by 2x or more, making it an unpleasant
			
 
				+tradeoff even when only used in a mixture of code.
			
 
				+
			
 
				+
			
 
				+##### Use an internal TLS location to pass predicate state
			
 
				+
			
 
				+We can define a special thread-local value to hold the predicate state between
			
 
				+functions. This avoids direct ABI implications by using a side channel between
			
 
				+callers and callees to communicate the predicate state. It also allows implicit
			
 
				+zero-initialization of the state, which allows non-checked code to be the first
			
 
				+code executed.
			
 
				+
			
 
				+However, this requires a load from TLS in the entry block, a store to TLS
			
 
				+before every call and every ret, and a load from TLS after every call. As a
			
 
				+consequence it is expected to be substantially more expensive even than using
			
 
				+`%rsp` and potentially `lfence` within the function entry block.
			
 
				+
			
 
				+
			
 
				+##### Define a new ABI and/or calling convention
			
 
				+
			
 
				+We could define a new ABI and/or calling convention to explicitly pass the
			
 
				+predicate state in and out of functions. This may be interesting if none of the
			
 
				+alternatives have adequate performance, but it makes deployment and adoption
			
 
				+dramatically more complex, and potentially infeasible.
			
 
				+
			
 
				+
			
 
				+## High-Level Alternative Mitigation Strategies
			
 
				+
			
 
				+There are completely different alternative approaches to mitigating variant 1
			
 
				+attacks. [Most](https://lwn.net/Articles/743265/)
			
 
				+[discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
			
 
				+specific known attackable components in the Linux kernel (or other kernels) by
			
 
				+manually rewriting the code to contain an instruction sequence that is not
			
 
				+vulnerable. For x86 systems this is done by either injecting an `lfence`
			
 
				+instruction along the code path which would leak data if executed speculatively
			
 
				+or by rewriting memory accesses to have branch-less masking to a known safe
			
 
				+region. On Intel systems, `lfence` [will prevent the speculative load of secret
			
 
				+data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
			
 
				+On AMD systems `lfence` is currently a no-op, but can be made
			
 
				+dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
			
 
				+code path ([mitigation G-2 +
			
 
				+V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
			
 
				+
			
 
				+However, this relies on finding and enumerating all possible points in code
			
 
				+which could be attacked to leak information. While in some cases static
			
 
				+analysis is effective at doing this at scale, in many cases it still relies on
			
 
				+human judgement to evaluate whether code might be vulnerable. Especially for
			
 
				+software systems which receive less detailed scrutiny but remain sensitive to
			
 
				+these attacks, this seems like an impractical security model. We need an
			
 
				+automatic and systematic mitigation strategy.
			
 
				+
			
 
				+
			
 
				+### Automatic `lfence` on Conditional Edges
			
 
				+
			
 
				+A natural way to scale up the existing hand-coded mitigations is simply to
			
 
				+inject an `lfence` instruction into both the target and fallthrough
			
 
				+destinations of every conditional branch. This ensures that no predicate or
			
 
				+bounds check can be bypassed speculatively. However, the performance overhead
			
 
				+of this approach is, simply put, catastrophic. Yet it remains the only truly
			
 
				+"secure by default" approach known prior to this effort and serves as the
			
 
				+baseline for performance.
			
 
				+
			
 
				+One attempt to address the performance overhead of this and make it more
			
 
				+realistic to deploy is [MSVC's /Qspectre
			
 
				+switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
			
 
				+Their technique is to use static analysis within the compiler to only insert
			
 
				+`lfence` instructions into conditional edges at risk of attack. However,
			
 
				+[initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
			
 
				+[analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
			
 
				+has shown that this approach is incomplete and only catches a small and limited
			
 
				+subset of attackable patterns which happen to resemble very closely the initial
			
 
				+proofs of concept. As such, while its performance is acceptable, it does not
			
 
				+appear to be an adequate systematic mitigation.
			
 
				+
			
 
				+
			
 
				+## Performance Overhead
			
 
				+
			
 
				+The performance overhead of this style of comprehensive mitigation is very
			
 
				+high. However, it compares very favorably with previously recommended
			
 
				+approaches such as the `lfence` instruction. Just as users can restrict the
			
 
				+scope of `lfence` to control its performance impact, this mitigation technique
			
 
				+could be restricted in scope as well.
			
 
				+
			
 
				+However, it is important to understand what it would cost to get a fully
			
 
				+mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
			
 
				+using all of the tricks to improve performance (so leaves the low 2gb
			
 
				+unprotected and +/- 2gb surrounding any PC in the program). We ran both
			
 
				+Google's microbenchmark suite and a large highly-tuned server built using
			
 
				+ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
			
 
				+instructions, and benchmarks were run on large Haswell servers. We collected
			
 
				+data both with an `lfence`-based mitigation and load hardening as presented
			
 
				+here. The summary is that mitigating with load hardening is 1.77x faster than
			
 
				+mitigating with `lfence`, and the overhead of load hardening compared to a
			
 
				+normal program is likely between a 10% overhead and a 50% overhead with most
			
 
				+large applications seeing a 30% overhead or less.
			
 
				+
			
 
				+| Benchmark                              | `lfence` | Load Hardening | Mitigated Speedup |
			
 
				+| -------------------------------------- | -------: | -------------: | ----------------: |
			
 
				+| Google microbenchmark suite            |   -74.8% |         -36.4% |          **2.5x** |
			
 
				+| Large server QPS (using ThinLTO & PGO) |   -62%   |         -29%   |          **1.8x** |
			
 
				+
			
 
				+Below is a visualization of the microbenchmark suite results which helps show
			
 
				+the distribution of results that is somewhat lost in the summary. The y-axis is
			
 
				+a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
			
 
				+-> better). Each box-and-whiskers represents one microbenchmark which may have
			
 
				+many different metrics measured. The red line marks the median, the box marks
			
 
				+the first and third quartiles, and the whiskers mark the min and max.
			
 
				+
			
 
				+![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png)
			
 
				+
			
 
				+We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
			
 
				+work on getting that. Still, the above should give a pretty clear
			
 
				+characterization of the performance, and specific benchmarks are unlikely to
			
 
				+reveal especially interesting properties.
			
 
				+
			
 
				+
			
 
				+### Future Work: Fine Grained Control and API-Integration
			
 
				+
			
 
				+The performance overhead of this technique is likely to be very significant and
			
 
				+something users wish to control or reduce. There are interesting options here
			
 
				+that impact the implementation strategy used.
			
 
				+
			
 
				+One particularly appealing option is to allow both opt-in and opt-out of this
			
 
				+mitigation at reasonably fine granularity such as on a per-function basis,
			
 
				+including intelligent handling of inlining decisions -- protected code can be
			
 
				+prevented from inlining into unprotected code, and unprotected code will become
			
 
				+protected when inlined into protected code. For systems where only a limited
			
 
				+set of code is reachable by externally controlled inputs, it may be possible to
			
 
				+limit the scope of mitigation through such mechanisms without compromising the
			
 
				+application's overall security. The performance impact may also be focused in a
			
 
				+few key functions that can be hand-mitigated in ways that have lower
			
 
				+performance overhead while the remainder of the application receives automatic
			
 
				+protection.
			
 
				+
			
 
				+For both limiting the scope of mitigation or manually mitigating hot functions,
			
 
				+there needs to be some support for mixing mitigated and unmitigated code
			
 
				+without completely defeating the mitigation. For the first use case, it would
			
 
				+be particularly desirable that mitigated code remains safe when being called
			
 
				+during misspeculation from unmitigated code.
			
 
				+
			
 
				+For the second use case, it may be important to connect the automatic
			
 
				+mitigation technique to explicit mitigation APIs such as what is described in
			
 
				+http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
			
 
				+to switch from automatic to manual mitigation without immediately exposing a
			
 
				+hole. However, the design for how to do this is hard to come up with until the
			
 
				+APIs are better established. We will revisit this as those APIs mature.
			
--- a/docs/speculative_load_hardening_microbenchmarks.png
+++ b/docs/speculative_load_hardening_microbenchmarks.png