|
@@ -0,0 +1,1099 @@
|
|
|
+# Speculative Load Hardening
|
|
|
+
|
|
|
+### A Spectre Variant #1 Mitigation Technique
|
|
|
+
|
|
|
+Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
|
|
|
+
|
|
|
+## Problem Statement
|
|
|
+
|
|
|
+Recently, Google Project Zero and other researchers have found information leak
|
|
|
+vulnerabilities by exploiting speculative execution in modern CPUs. These
|
|
|
+exploits are currently broken down into three variants:
|
|
|
+* GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
|
|
|
+* GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
|
|
|
+* GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
|
|
|
+
|
|
|
+For more details, see the Google Project Zero blog post and the Spectre research
|
|
|
+paper:
|
|
|
+* https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
|
|
|
+* https://spectreattack.com/spectre.pdf
|
|
|
+
|
|
|
+The core problem of GPZ Variant #1 is that speculative execution uses branch
|
|
|
+prediction to select the path of instructions speculatively executed. This path
|
|
|
+is speculatively executed with the available data, and may load from memory and
|
|
|
+leak the loaded values through various side channels that survive even when the
|
|
|
+speculative execution is unwound due to being incorrect. Mispredicted paths can
|
|
|
+cause code to be executed with data inputs that never occur in correct
|
|
|
+executions, making checks against malicious inputs ineffective and allowing
|
|
|
+attackers to use malicious data inputs to leak secret data. Here is an example,
|
|
|
+extracted and simplified from the Project Zero paper:
|
|
|
+```
|
|
|
+struct array {
|
|
|
+ unsigned long length;
|
|
|
+ unsigned char data[];
|
|
|
+};
|
|
|
+struct array *arr1 = ...; // small array
|
|
|
+struct array *arr2 = ...; // array of size 0x400
|
|
|
+unsigned long untrusted_offset_from_caller = ...;
|
|
|
+if (untrusted_offset_from_caller < arr1->length) {
|
|
|
+ unsigned char value = arr1->data[untrusted_offset_from_caller];
|
|
|
+ unsigned long index2 = ((value&1)*0x100)+0x200;
|
|
|
+ unsigned char value2 = arr2->data[index2];
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+The key of the attack is to call this with `untrusted_offset_from_caller` that
|
|
|
+is far outside of the bounds when the branch predictor will predict that it
|
|
|
+will be in-bounds. In that case, the body of the `if` will be executed
|
|
|
+speculatively, and may read secret data into `value` and leak it via a
|
|
|
+cache-timing side channel when a dependent access is made to populate `value2`.
|
|
|
+
|
|
|
+## High Level Mitigation Approach
|
|
|
+
|
|
|
+While several approaches are being actively pursued to mitigate specific
|
|
|
+branches and/or loads inside especially risky software (most notably various OS
|
|
|
+kernels), these approaches require manual and/or static analysis aided auditing
|
|
|
+of code and explicit source changes to apply the mitigation. They are unlikely
|
|
|
+to scale well to large applications. We are proposing a comprehensive
|
|
|
+mitigation approach that would apply automatically across an entire program
|
|
|
+rather than through manual changes to the code. While this is likely to have a
|
|
|
+high performance cost, some applications may be in a good position to take this
|
|
|
+performance / security tradeoff.
|
|
|
+
|
|
|
+The specific technique we propose is to cause loads to be checked using
|
|
|
+branchless code to ensure that they are executing along a valid control flow
|
|
|
+path. Consider the following C-pseudo-code representing the core idea of a
|
|
|
+predicate guarding potentially invalid loads:
|
|
|
+```
|
|
|
+void leak(int data);
|
|
|
+void example(int* pointer1, int* pointer2) {
|
|
|
+ if (condition) {
|
|
|
+ // ... lots of code ...
|
|
|
+ leak(*pointer1);
|
|
|
+ } else {
|
|
|
+ // ... more code ...
|
|
|
+ leak(*pointer2);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+This would get transformed into something resembling the following:
|
|
|
+```
|
|
|
+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
|
|
|
+uintptr_t all_zeros_mask = 0;
|
|
|
+void leak(int data);
|
|
|
+void example(int* pointer1, int* pointer2) {
|
|
|
+ uintptr_t predicate_state = all_ones_mask;
|
|
|
+ if (condition) {
|
|
|
+ // Assuming ?: is implemented using branchless logic...
|
|
|
+ predicate_state = !condition ? all_zeros_mask : predicate_state;
|
|
|
+ // ... lots of code ...
|
|
|
+ //
|
|
|
+ // Harden the pointer so it can't be loaded
|
|
|
+ pointer1 &= predicate_state;
|
|
|
+ leak(*pointer1);
|
|
|
+ } else {
|
|
|
+ predicate_state = condition ? all_zeros_mask : predicate_state;
|
|
|
+ // ... more code ...
|
|
|
+ //
|
|
|
+ // Alternative: Harden the loaded value
|
|
|
+ int value2 = *pointer2 & predicate_state;
|
|
|
+ leak(value2);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+The result should be that if the `if (condition) {` branch is mis-predicted,
|
|
|
+there is a *data* dependency on the condition used to zero out any pointers
|
|
|
+prior to loading through them or to zero out all of the loaded bits. Even
|
|
|
+though this code pattern may still execute speculatively, *invalid* speculative
|
|
|
+executions are prevented from leaking secret data from memory (but note that
|
|
|
+this data might still be loaded in safe ways, and some regions of memory are
|
|
|
+required to not hold secrets, see below for detailed limitations). This
|
|
|
+approach only requires the underlying hardware have a way to implement a
|
|
|
+branchless and unpredicted conditional update of a register's value. All modern
|
|
|
+architectures have support for this, and in fact such support is necessary to
|
|
|
+correctly implement constant time cryptographic primitives.
|
|
|
+
|
|
|
+Crucial properties of this approach:
|
|
|
+* It is not preventing any particular side-channel from working. This is
|
|
|
+ important as there are an unknown number of potential side channels and we
|
|
|
+ expect to continue discovering more. Instead, it prevents the observation of
|
|
|
+ secret data in the first place.
|
|
|
+* It accumulates the predicate state, protecting even in the face of nested
|
|
|
+ *correctly* predicted control flows.
|
|
|
+* It passes this predicate state across function boundaries to provide
|
|
|
+ [interprocedural protection](#interprocedural-checking).
|
|
|
+* When hardening the address of a load, it uses a *destructive* or
|
|
|
+ *non-reversible* modification of the address to prevent an attacker from
|
|
|
+ reversing the check using attacker-controlled inputs.
|
|
|
+* It does not completely block speculative execution, and merely prevents
|
|
|
+ *mis*-speculated paths from leaking secrets from memory (and stalls
|
|
|
+ speculation until this can be determined).
|
|
|
+* It is completely general and makes no fundamental assumptions about the
|
|
|
+ underlying architecture other than the ability to do branchless conditional
|
|
|
+ data updates and a lack of value prediction.
|
|
|
+* It does not require programmers to identify all possible secret data using
|
|
|
+ static source code annotations or code vulnerable to a variant #1 style
|
|
|
+ attack.
|
|
|
+
|
|
|
+Limitations of this approach:
|
|
|
+* It requires re-compiling source code to insert hardening instruction
|
|
|
+ sequences. Only software compiled in this mode is protected.
|
|
|
+* The performance is heavily dependent on a particular architecture's
|
|
|
+ implementation strategy. We outline a potential x86 implementation below and
|
|
|
+ characterize its performance.
|
|
|
+* It does not defend against secret data already loaded from memory and
|
|
|
+ residing in registers or leaked through other side-channels in
|
|
|
+ non-speculative execution. Code dealing with this, e.g cryptographic
|
|
|
+ routines, already uses constant-time algorithms and code to prevent
|
|
|
+ side-channels. Such code should also scrub registers of secret data following
|
|
|
+ [these
|
|
|
+ guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
|
|
|
+* To achieve reasonable performance, many loads may not be checked, such as
|
|
|
+ those with compile-time fixed addresses. This primarily consists of accesses
|
|
|
+ at compile-time constant offsets of global and local variables. Code which
|
|
|
+ needs this protection and intentionally stores secret data must ensure the
|
|
|
+ memory regions used for secret data are necessarily dynamic mappings or heap
|
|
|
+ allocations. This is an area which can be tuned to provide more comprehensive
|
|
|
+ protection at the cost of performance.
|
|
|
+* [Hardened loads](#hardening-the-address-of-the-load) may still load data from
|
|
|
+ _valid_ addresses if not _attacker-controlled_ addresses. To prevent these
|
|
|
+ from reading secret data, the low 2gb of the address space and 2gb above and
|
|
|
+ below any executable pages should be protected.
|
|
|
+
|
|
|
+Credit:
|
|
|
+* The core idea of tracing misspeculation through data and marking pointers to
|
|
|
+ block misspeculated loads was developed as part of a HACS 2018 discussion
|
|
|
+ between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
|
|
|
+ individuals.
|
|
|
+* Core idea of masking out loaded bits was part of the original mitigation
|
|
|
+ suggested by Jann Horn when these attacks were reported.
|
|
|
+
|
|
|
+
|
|
|
+### Indirect Branches, Calls, and Returns
|
|
|
+
|
|
|
+It is possible to attack control flow other than conditional branches with
|
|
|
+variant #1 style mispredictions.
|
|
|
+* A prediction towards a hot call target of a virtual method can lead to it
|
|
|
+ being speculatively executed when an expected type is used (often called
|
|
|
+ "type confusion").
|
|
|
+* A hot case may be speculatively executed due to prediction instead of the
|
|
|
+ correct case for a switch statement implemented as a jump table.
|
|
|
+* A hot common return address may be predicted incorrectly when returning from
|
|
|
+ a function.
|
|
|
+
|
|
|
+These code patterns are also vulnerable to Spectre variant #2, and as such are
|
|
|
+best mitigated with a
|
|
|
+[retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
|
|
|
+When a mitigation technique like retpoline is used, speculation simply cannot
|
|
|
+proceed through an indirect control flow edge (or it cannot be mispredicted in
|
|
|
+the case of a filled RSB) and so it is also protected from variant #1 style
|
|
|
+attacks. However, some architectures, micro-architectures, or vendors do not
|
|
|
+employ the retpoline mitigation, and on future x86 hardware (both Intel and
|
|
|
+AMD) it is expected to become unnecessary due to hardware-based mitigation.
|
|
|
+
|
|
|
+When not using a retpoline, these edges will need independent protection from
|
|
|
+variant #1 style attacks. The analogous approach to that used for conditional
|
|
|
+control flow should work:
|
|
|
+```
|
|
|
+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
|
|
|
+uintptr_t all_zeros_mask = 0;
|
|
|
+void leak(int data);
|
|
|
+void example(int* pointer1, int* pointer2) {
|
|
|
+ uintptr_t predicate_state = all_ones_mask;
|
|
|
+ switch (condition) {
|
|
|
+ case 0:
|
|
|
+ // Assuming ?: is implemented using branchless logic...
|
|
|
+ predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
|
|
|
+ // ... lots of code ...
|
|
|
+ //
|
|
|
+ // Harden the pointer so it can't be loaded
|
|
|
+ pointer1 &= predicate_state;
|
|
|
+ leak(*pointer1);
|
|
|
+ break;
|
|
|
+
|
|
|
+ case 1:
|
|
|
+ predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
|
|
|
+ // ... more code ...
|
|
|
+ //
|
|
|
+ // Alternative: Harden the loaded value
|
|
|
+ int value2 = *pointer2 & predicate_state;
|
|
|
+ leak(value2);
|
|
|
+ break;
|
|
|
+
|
|
|
+ // ...
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+The core idea remains the same: validate the control flow using data-flow and
|
|
|
+use that validation to check that loads cannot leak information along
|
|
|
+misspeculated paths. Typically this involves passing the desired target of such
|
|
|
+control flow across the edge and checking that it is correct afterwards. Note
|
|
|
+that while it is tempting to think that this mitigates variant #2 attacks, it
|
|
|
+does not. Those attacks go to arbitrary gadgets that don't include the checks.
|
|
|
+
|
|
|
+
|
|
|
+### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
|
|
|
+
|
|
|
+Beyond the core variant #1 attack, there are techniques to extend this attack.
|
|
|
+The primary technique is known as "Bounds Check Bypass Store" and is discussed
|
|
|
+in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
|
|
|
+
|
|
|
+We will analyze these two variants independently. First, variant #1.1 works by
|
|
|
+speculatively storing over the return address after a bounds check bypass. This
|
|
|
+speculative store then ends up being used by the CPU during speculative
|
|
|
+execution of the return, potentially directing speculative execution to
|
|
|
+arbitrary gadgets in the binary. Let's look at an example.
|
|
|
+```
|
|
|
+unsigned char local_buffer[4];
|
|
|
+unsigned char *untrusted_data_from_caller = ...;
|
|
|
+unsigned long untrusted_size_from_caller = ...;
|
|
|
+if (untrusted_size_from_caller < sizeof(local_buffer)) {
|
|
|
+ // Speculative execution enters here with a too-large size.
|
|
|
+ memcpy(local_buffer, untrusted_data_from_caller,
|
|
|
+ untrusted_size_from_caller);
|
|
|
+ // The stack has now been smashed, writing an attacker-controlled
|
|
|
+ // address over the return adress.
|
|
|
+ minor_processing(local_buffer);
|
|
|
+ return;
|
|
|
+ // Control will speculate to the attacker-written address.
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+However, this can be mitigated by hardening the load of the return address just
|
|
|
+like any other load. This is sometimes complicated because x86 for example
|
|
|
+*implicitly* loads the return address off the stack. However, the
|
|
|
+implementation technique below is specifically designed to mitigate this
|
|
|
+implicit load by using the stack pointer to communicate misspeculation between
|
|
|
+functions. This additionally causes a misspeculation to have an invalid stack
|
|
|
+pointer and never be able to read the speculatively stored return address. See
|
|
|
+the detailed discussion below.
|
|
|
+
|
|
|
+For variant #1.2, the attacker speculatively stores into the vtable or jump
|
|
|
+table used to implement an indirect call or indirect jump. Because this is
|
|
|
+speculative, this will often be possible even when these are stored in
|
|
|
+read-only pages. For example:
|
|
|
+```
|
|
|
+class FancyObject : public BaseObject {
|
|
|
+public:
|
|
|
+ void DoSomething() override;
|
|
|
+};
|
|
|
+void f(unsigned long attacker_offset, unsigned long attacker_data) {
|
|
|
+ FancyObject object = getMyObject();
|
|
|
+ unsigned long *arr[4] = getFourDataPointers();
|
|
|
+ if (attacker_offset < 4) {
|
|
|
+ // We have bypassed the bounds check speculatively.
|
|
|
+ unsigned long *data = arr[attacker_offset];
|
|
|
+ // Now we have computed a pointer inside of `object`, the vptr.
|
|
|
+ *data = attacker_data;
|
|
|
+ // The vptr points to the virtual table and we speculatively clobber that.
|
|
|
+ g(object); // Hand the object to some other routine.
|
|
|
+ }
|
|
|
+}
|
|
|
+// In another file, we call a method on the object.
|
|
|
+void g(BaseObject &object) {
|
|
|
+ object.DoSomething();
|
|
|
+ // This speculatively calls the address stored over the vtable.
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Mitigating this requires hardening loads from these locations, or mitigating
|
|
|
+the indirect call or indirect jump. Any of these are sufficient to block the
|
|
|
+call or jump from using a speculatively stored value that has been read back.
|
|
|
+
|
|
|
+For both of these, using retpolines would be equally sufficient. One possible
|
|
|
+hybrid approach is to use retpolines for indirect call and jump, while relying
|
|
|
+on SLH to mitigate returns.
|
|
|
+
|
|
|
+Another approach that is sufficient for both of these is to harden all of the
|
|
|
+speculative stores. However, as most stores aren't interesting and don't
|
|
|
+inherently leak data, this is expected to be prohibitively expensive given the
|
|
|
+attack it is defending against.
|
|
|
+
|
|
|
+
|
|
|
+## Implementation Details
|
|
|
+
|
|
|
+There are a number of complex details impacting the implementation of this
|
|
|
+technique, both on a particular architecture and within a particular compiler.
|
|
|
+We discuss proposed implementation techniques for the x86 architecture and the
|
|
|
+LLVM compiler. These are primarily to serve as an example, as other
|
|
|
+implementation techniques are very possible.
|
|
|
+
|
|
|
+
|
|
|
+### x86 Implementation Details
|
|
|
+
|
|
|
+On the x86 platform we break down the implementation into three core
|
|
|
+components: accumulating the predicate state through the control flow graph,
|
|
|
+checking the loads, and checking control transfers between procedures.
|
|
|
+
|
|
|
+
|
|
|
+#### Accumulating Predicate State
|
|
|
+
|
|
|
+Consider baseline x86 instructions like the following, which test three
|
|
|
+conditions and if all pass, loads data from memory and potentially leaks it
|
|
|
+through some side channel:
|
|
|
+```
|
|
|
+# %bb.0: # %entry
|
|
|
+ pushq %rax
|
|
|
+ testl %edi, %edi
|
|
|
+ jne .LBB0_4
|
|
|
+# %bb.1: # %then1
|
|
|
+ testl %esi, %esi
|
|
|
+ jne .LBB0_4
|
|
|
+# %bb.2: # %then2
|
|
|
+ testl %edx, %edx
|
|
|
+ je .LBB0_3
|
|
|
+.LBB0_4: # %exit
|
|
|
+ popq %rax
|
|
|
+ retq
|
|
|
+.LBB0_3: # %danger
|
|
|
+ movl (%rcx), %edi
|
|
|
+ callq leak
|
|
|
+ popq %rax
|
|
|
+ retq
|
|
|
+```
|
|
|
+
|
|
|
+When we go to speculatively execute the load, we want to know whether any of
|
|
|
+the dynamically executed predicates have been misspeculated. To track that,
|
|
|
+along each conditional edge, we need to track the data which would allow that
|
|
|
+edge to be taken. On x86, this data is stored in the flags register used by the
|
|
|
+conditional jump instruction. Along both edges after this fork in control flow,
|
|
|
+the flags register remains alive and contains data that we can use to build up
|
|
|
+our accumulated predicate state. We accumulate it using the x86 conditional
|
|
|
+move instruction which also reads the flag registers where the state resides.
|
|
|
+These conditional move instructions are known to not be predicted on any x86
|
|
|
+processors, making them immune to misprediction that could reintroduce the
|
|
|
+vulnerability. When we insert the conditional moves, the code ends up looking
|
|
|
+like the following:
|
|
|
+```
|
|
|
+# %bb.0: # %entry
|
|
|
+ pushq %rax
|
|
|
+ xorl %eax, %eax # Zero out initial predicate state.
|
|
|
+ movq $-1, %r8 # Put all-ones mask into a register.
|
|
|
+ testl %edi, %edi
|
|
|
+ jne .LBB0_1
|
|
|
+# %bb.2: # %then1
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ testl %esi, %esi
|
|
|
+ jne .LBB0_1
|
|
|
+# %bb.3: # %then2
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ testl %edx, %edx
|
|
|
+ je .LBB0_4
|
|
|
+.LBB0_1:
|
|
|
+ cmoveq %r8, %rax # Conditionally update predicate state.
|
|
|
+ popq %rax
|
|
|
+ retq
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ ...
|
|
|
+```
|
|
|
+
|
|
|
+Here we create the "empty" or "correct execution" predicate state by zeroing
|
|
|
+`%rax`, and we create a constant "incorrect execution" predicate value by
|
|
|
+putting `-1` into `%r8`. Then, along each edge coming out of a conditional
|
|
|
+branch we do a conditional move that in a correct execution will be a no-op,
|
|
|
+but if misspeculated, will replace the `%rax` with the value of `%r8`.
|
|
|
+Misspeculating any one of the three predicates will cause `%rax` to hold the
|
|
|
+"incorrect execution" value from `%r8` as we preserve incoming values when
|
|
|
+execution is correct rather than overwriting it.
|
|
|
+
|
|
|
+We now have a value in `%rax` in each basic block that indicates if at some
|
|
|
+point previously a predicate was mispredicted. And we have arranged for that
|
|
|
+value to be particularly effective when used below to harden loads.
|
|
|
+
|
|
|
+
|
|
|
+##### Indirect Call, Branch, and Return Predicates
|
|
|
+
|
|
|
+(Not yet implemented.)
|
|
|
+
|
|
|
+There is no analogous flag to use when tracing indirect calls, branches, and
|
|
|
+returns. The predicate state must be accumulated through some other means.
|
|
|
+Fundamentally, this is the reverse of the problem posed in CFI: we need to
|
|
|
+check where we came from rather than where we are going. For function-local
|
|
|
+jump tables, this is easily arranged by testing the input to the jump table
|
|
|
+within each destination:
|
|
|
+```
|
|
|
+ pushq %rax
|
|
|
+ xorl %eax, %eax # Zero out initial predicate state.
|
|
|
+ movq $-1, %r8 # Put all-ones mask into a register.
|
|
|
+ jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table.
|
|
|
+.LBB0_2: # %sw.bb
|
|
|
+ testq $0, %rdi # Validate index used for jump table.
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ ...
|
|
|
+ jmp _Z4leaki # TAILCALL
|
|
|
+
|
|
|
+.LBB0_3: # %sw.bb1
|
|
|
+ testq $1, %rdi # Validate index used for jump table.
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ ...
|
|
|
+ jmp _Z4leaki # TAILCALL
|
|
|
+
|
|
|
+.LBB0_5: # %sw.bb10
|
|
|
+ testq $2, %rdi # Validate index used for jump table.
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ ...
|
|
|
+ jmp _Z4leaki # TAILCALL
|
|
|
+ ...
|
|
|
+
|
|
|
+ .section .rodata,"a",@progbits
|
|
|
+ .p2align 3
|
|
|
+.LJTI0_0:
|
|
|
+ .quad .LBB0_2
|
|
|
+ .quad .LBB0_3
|
|
|
+ .quad .LBB0_5
|
|
|
+ ...
|
|
|
+```
|
|
|
+
|
|
|
+Returns have a simple mitigation technique on x86-64 (or other ABIs which have
|
|
|
+what is called a "red zone" region beyond the end of the stack). This region is
|
|
|
+guaranteed to be preserved across interrupts and context switches, making the
|
|
|
+return address used in returning to the current code remain on the stack and
|
|
|
+valid to read. We can emit code in the caller to verify that a return edge was
|
|
|
+not mispredicted:
|
|
|
+```
|
|
|
+ callq other_function
|
|
|
+return_addr:
|
|
|
+ testq -8(%rsp), return_addr # Validate return address.
|
|
|
+ cmovneq %r8, %rax # Update predicate state.
|
|
|
+```
|
|
|
+
|
|
|
+For an ABI without a "red zone" (and thus unable to read the return address
|
|
|
+from the stack), mitigating returns face similar problems to calls below.
|
|
|
+
|
|
|
+Indirect calls (and returns in the absence of a red zone ABI) pose the most
|
|
|
+significant challenge to propagate. The simplest technique would be to define a
|
|
|
+new ABI such that the intended call target is passed into the called function
|
|
|
+and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
|
|
|
+in C and C++. While the target function could be passed in TLS, we would still
|
|
|
+require complex logic to handle a mixture of functions compiled with and
|
|
|
+without this extra logic (essentially, making the ABI backwards compatible).
|
|
|
+Currently, we suggest using retpolines here and will continue to investigate
|
|
|
+ways of mitigating this.
|
|
|
+
|
|
|
+
|
|
|
+##### Optimizations, Alternatives, and Tradeoffs
|
|
|
+
|
|
|
+Merely accumulating predicate state involves significant cost. There are
|
|
|
+several key optimizations we employ to minimize this and various alternatives
|
|
|
+that present different tradeoffs in the generated code.
|
|
|
+
|
|
|
+First, we work to reduce the number of instructions used to track the state:
|
|
|
+* Rather than inserting a `cmovCC` instruction along every conditional edge in
|
|
|
+ the original program, we track each set of condition flags we need to capture
|
|
|
+ prior to entering each basic block and reuse a common `cmovCC` sequence for
|
|
|
+ those.
|
|
|
+ * We could further reuse suffixes when there are multiple `cmovCC`
|
|
|
+ instructions required to capture the set of flags. Currently this is
|
|
|
+ believed to not be worth the cost as paired flags are relatively rare and
|
|
|
+ suffixes of them are exceedingly rare.
|
|
|
+* A common pattern in x86 is to have multiple conditional jump instructions
|
|
|
+ that use the same flags but handle different conditions. Naively, we could
|
|
|
+ consider each fallthrough between them an "edge" but this causes a much more
|
|
|
+ complex control flow graph. Instead, we accumulate the set of conditions
|
|
|
+ necessary for fallthrough and use a sequence of `cmovCC` instructions in a
|
|
|
+ single fallthrough edge to track it.
|
|
|
+
|
|
|
+Second, we trade register pressure for simpler `cmovCC` instructions by
|
|
|
+allocating a register for the "bad" state. We could read that value from memory
|
|
|
+as part of the conditional move instruction, however, this creates more
|
|
|
+micro-ops and requires the load-store unit to be involved. Currently, we place
|
|
|
+the value into a virtual register and allow the register allocator to decide
|
|
|
+when the register pressure is sufficient to make it worth spilling to memory
|
|
|
+and reloading.
|
|
|
+
|
|
|
+
|
|
|
+#### Hardening Loads
|
|
|
+
|
|
|
+Once we have the predicate accumulated into a special value for correct vs.
|
|
|
+misspeculated, we need to apply this to loads in a way that ensures they do not
|
|
|
+leak secret data. There are two primary techniques for this: we can either
|
|
|
+harden the loaded value to prevent observation, or we can harden the address
|
|
|
+itself to prevent the load from occuring. These have significantly different
|
|
|
+performance tradeoffs.
|
|
|
+
|
|
|
+
|
|
|
+##### Hardening loaded values
|
|
|
+
|
|
|
+The most appealing way to harden loads is to mask out all of the bits loaded.
|
|
|
+The key requirement is that for each bit loaded, along the misspeculated path
|
|
|
+that bit is always fixed at either 0 or 1 regardless of the value of the bit
|
|
|
+loaded. The most obvious implementation uses either an `and` instruction with
|
|
|
+an all-zero mask along misspeculated paths and an all-one mask along correct
|
|
|
+paths, or an `or` instruction with an all-one mask along misspeculated paths
|
|
|
+and an all-zero mask along correct paths. Other options become less appealing
|
|
|
+such as multiplying by zero, or multiple shift instructions. For reasons we
|
|
|
+elaborate on below, we end up suggesting you use `or` with an all-ones mask,
|
|
|
+making the x86 instruction sequence look like the following:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ movl (%rsi), %edi # Load potentially secret data from %rsi.
|
|
|
+ orl %eax, %edi
|
|
|
+```
|
|
|
+
|
|
|
+Other useful patterns may be to fold the load into the `or` instruction itself
|
|
|
+at the cost of a register-to-register copy.
|
|
|
+
|
|
|
+There are some challenges with deploying this approach:
|
|
|
+1. Many loads on x86 are folded into other instructions. Separating them would
|
|
|
+ add very significant and costly register pressure with prohibitive
|
|
|
+ performance cost.
|
|
|
+1. Loads may not target a general purpose register requiring extra instructions
|
|
|
+ to map the state value into the correct register class, and potentially more
|
|
|
+ expensive instructions to mask the value in some way.
|
|
|
+1. The flags registers on x86 are very likely to be live, and challenging to
|
|
|
+ preserve cheaply.
|
|
|
+1. There are many more values loaded than pointers & indices used for loads. As
|
|
|
+ a consequence, hardening the result of a load requires substantially more
|
|
|
+ instructions than hardening the address of the load (see below).
|
|
|
+
|
|
|
+Despite these challenges, hardening the result of the load critically allows
|
|
|
+the load to proceed and thus has dramatically less impact on the total
|
|
|
+speculative / out-of-order potential of the execution. There are also several
|
|
|
+interesting techniques to try and mitigate these challenges and make hardening
|
|
|
+the results of loads viable in at least some cases. However, we generally
|
|
|
+expect to fall back when unprofitable from hardening the loaded value to the
|
|
|
+next approach of hardening the address itself.
|
|
|
+
|
|
|
+
|
|
|
+###### Loads folded into data-invariant operations can be hardened after the operation
|
|
|
+
|
|
|
+The first key to making this feasible is to recognize that many operations on
|
|
|
+x86 are "data-invariant". That is, they have no (known) observable behavior
|
|
|
+differences due to the particular input data. These instructions are often used
|
|
|
+when implementing cryptographic primitives dealing with private key data
|
|
|
+because they are not believed to provide any side-channels. Similarly, we can
|
|
|
+defer hardening until after them as they will not in-and-of-themselves
|
|
|
+introduce a speculative execution side-channel. This results in code sequences
|
|
|
+that look like:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ addl (%rsi), %edi # Load and accumulate without leaking.
|
|
|
+ orl %eax, %edi
|
|
|
+```
|
|
|
+
|
|
|
+While an addition happens to the loaded (potentially secret) value, that
|
|
|
+doesn't leak any data and we then immediately harden it.
|
|
|
+
|
|
|
+
|
|
|
+###### Hardening of loaded values deferred down the data-invariant expression graph
|
|
|
+
|
|
|
+We can generalize the previous idea and sink the hardening down the expression
|
|
|
+graph across as many data-invariant operations as desirable. This can use very
|
|
|
+conservative rules for whether something is data-invariant. The primary goal
|
|
|
+should be to handle multiple loads with a single hardening instruction:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ addl (%rsi), %edi # Load and accumulate without leaking.
|
|
|
+ addl 4(%rsi), %edi # Continue without leaking.
|
|
|
+ addl 8(%rsi), %edi
|
|
|
+ orl %eax, %edi # Mask out bits from all three loads.
|
|
|
+```
|
|
|
+
|
|
|
+
|
|
|
+###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
|
|
|
+
|
|
|
+Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
|
|
|
+without touching the flag registers. However, we can harden loaded values that
|
|
|
+are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
|
|
|
+64-bits on 64-bit systems) by zero-extending the value to the full word size
|
|
|
+and then shifting right by at least the number of original bits using the BMI2
|
|
|
+`shrx` instruction:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ addl (%rsi), %edi # Load and accumulate 32 bits of data.
|
|
|
+ shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded.
|
|
|
+```
|
|
|
+
|
|
|
+Because on x86 the zero-extend is free, this can efficiently harden the loaded
|
|
|
+value.
|
|
|
+
|
|
|
+
|
|
|
+##### Hardening the address of the load
|
|
|
+
|
|
|
+When hardening the loaded value is inapplicable, most often because the
|
|
|
+instruction directly leaks information (like `cmp` or `jmpq`), we switch to
|
|
|
+hardening the _address_ of the load instead of the loaded value. This avoids
|
|
|
+increasing register pressure by unfolding the load or paying some other high
|
|
|
+cost.
|
|
|
+
|
|
|
+To understand how this works in practice, we need to examine the exact
|
|
|
+semantics of the x86 addressing modes which, in its fully general form, looks
|
|
|
+like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
|
|
|
+registers that can potentially be any value, and may be attacker controlled,
|
|
|
+and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
|
|
|
+`4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
|
|
|
+computation performed to find the address is then: `%base + (scale * %index) +
|
|
|
+offset` under 64-bit 2's complement modular arithmetic.
|
|
|
+
|
|
|
+One issue with this approach is that, after hardening, the `%base + (scale *
|
|
|
+%index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
|
|
|
+then a large, positive `offset` will index into memory within the first two
|
|
|
+gigabytes of address space. While these offsets are not attacker controlled,
|
|
|
+the attacker could chose to attack a load which happens to have the desired
|
|
|
+offset and then successfully read memory in that region. This significantly
|
|
|
+raises the burden on the attacker and limits the scope of attack but does not
|
|
|
+eliminate it. To fully close the attack we must work with the operating system
|
|
|
+to preclude mapping memory in the low two gigabytes of address space.
|
|
|
+
|
|
|
+
|
|
|
+###### 64-bit load checking instructions
|
|
|
+
|
|
|
+We can use the following instruction sequences to check loads. We set up `%r8`
|
|
|
+in these examples to hold the special value of `-1` which will be `cmov`ed over
|
|
|
+`%rax` in misspeculated paths.
|
|
|
+
|
|
|
+Single register addressing mode:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ orq %rax, %rsi # Mask the pointer if misspeculating.
|
|
|
+ movl (%rsi), %edi
|
|
|
+```
|
|
|
+
|
|
|
+Two register addressing mode:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ orq %rax, %rsi # Mask the pointer if misspeculating.
|
|
|
+ orq %rax, %rcx # Mask the index if misspeculating.
|
|
|
+ movl (%rsi,%rcx), %edi
|
|
|
+```
|
|
|
+
|
|
|
+This will result in a negative address near zero or in `offset` wrapping the
|
|
|
+address space back to a small positive address. Small, negative addresses will
|
|
|
+fault in user-mode for most operating systems, but targets which need the high
|
|
|
+address space to be user accessible may need to adjust the exact sequence used
|
|
|
+above. Additionally, the low addresses will need to be marked unreadable by the
|
|
|
+OS to fully harden the load.
|
|
|
+
|
|
|
+
|
|
|
+###### RIP-relative addressing is even easier to break
|
|
|
+
|
|
|
+There is a common addressing mode idiom that is substantially harder to check:
|
|
|
+addressing relative to the instruction pointer. We cannot change the value of
|
|
|
+the instruction pointer register and so we have the harder problem of forcing
|
|
|
+`%base + scale * %index + offset` to be an invalid address, by *only* changing
|
|
|
+`%index`. The only advantage we have is that the attacker also cannot modify
|
|
|
+`%base`. If we use the fast instruction sequence above, but only apply it to
|
|
|
+the index, we will always access `%rip + (scale * -1) + offset`. If the
|
|
|
+attacker can find a load which with this address happens to point to secret
|
|
|
+data, then they can reach it. However, the loader and base libraries can also
|
|
|
+simply refuse to map the heap, data segments, or stack within 2gb of any of the
|
|
|
+text in the program, much like it can reserve the low 2gb of address space.
|
|
|
+
|
|
|
+
|
|
|
+###### The flag registers again make everything hard
|
|
|
+
|
|
|
+Unfortunately, the technique of using `orq`-instructions has a serious flaw on
|
|
|
+x86. The very thing that makes it easy to accumulate state, the flag registers
|
|
|
+containing predicates, causes serious problems here because they may be alive
|
|
|
+and used by the loading instruction or subsequent instructions. On x86, the
|
|
|
+`orq` instruction **sets** the flags and will override anything already there.
|
|
|
+This makes inserting them into the instruction stream very hazardous.
|
|
|
+Unfortunately, unlike when hardening the loaded value, we have no fallback here
|
|
|
+and so we must have a fully general approach available.
|
|
|
+
|
|
|
+The first thing we must do when generating these sequences is try to analyze
|
|
|
+the surrounding code to prove that the flags are not in fact alive or being
|
|
|
+used. Typically, it has been set by some other instruction which just happens
|
|
|
+to set the flags register (much like ours!) with no actual dependency. In those
|
|
|
+cases, it is safe to directly insert these instructions. Alternatively we may
|
|
|
+be able to move them earlier to avoid clobbering the used value.
|
|
|
+
|
|
|
+However, this may ultimately be impossible. In that case, we need to preserve
|
|
|
+the flags around these instructions:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ pushfq
|
|
|
+ orq %rax, %rcx # Mask the pointer if misspeculating.
|
|
|
+ orq %rax, %rdx # Mask the index if misspeculating.
|
|
|
+ popfq
|
|
|
+ movl (%rcx,%rdx), %edi
|
|
|
+```
|
|
|
+
|
|
|
+Using the `pushf` and `popf` instructions saves the flags register around our
|
|
|
+inserted code, but comes at a high cost. First, we must store the flags to the
|
|
|
+stack and reload them. Second, this causes the stack pointer to be adjusted
|
|
|
+dynamically, requiring a frame pointer be used for referring to temporaries
|
|
|
+spilled to the stack, etc.
|
|
|
+
|
|
|
+On newer x86 processors we can use the `lahf` and `sahf` instructions to save
|
|
|
+all of the flags besides the overflow flag in a register rather than on the
|
|
|
+stack. We can then use `seto` and `add` to save and restore the overflow flag
|
|
|
+in a register. Combined, this will save and restore flags in the same manner as
|
|
|
+above but using two registers rather than the stack. That is still very
|
|
|
+expensive if slightly less expensive than `pushf` and `popf` in most cases.
|
|
|
+
|
|
|
+
|
|
|
+###### A flag-less alternative on Haswell, Zen and newer processors
|
|
|
+
|
|
|
+Starting with the BMI2 x86 instruction set extensions available on Haswell and
|
|
|
+Zen processors, there is an instruction for shifting that does not set any
|
|
|
+flags: `shrx`. We can use this and the `lea` instruction to implement analogous
|
|
|
+code sequences to the above ones. However, these are still very marginally
|
|
|
+slower, as there are fewer ports able to dispatch shift instructions in most
|
|
|
+modern x86 processors than there are for `or` instructions.
|
|
|
+
|
|
|
+Fast, single register addressing mode:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating.
|
|
|
+ movl (%rsi), %edi
|
|
|
+```
|
|
|
+
|
|
|
+This will collapse the register to zero or one, and everything but the offset
|
|
|
+in the addressing mode to be less than or equal to 9. This means the full
|
|
|
+address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
|
|
|
+to protect an extra page of the low address space to account for this
|
|
|
+
|
|
|
+
|
|
|
+##### Optimizations
|
|
|
+
|
|
|
+A very large portion of the cost for this approach comes from checking loads in
|
|
|
+this way, so it is important to work to optimize this. However, beyond making
|
|
|
+the instruction sequences to *apply* the checks efficient (for example by
|
|
|
+avoiding `pushfq` and `popfq` sequences), the only significant optimization is
|
|
|
+to check fewer loads without introducing a vulnerability. We apply several
|
|
|
+techniques to accomplish that.
|
|
|
+
|
|
|
+
|
|
|
+###### Don't check loads from compile-time constant stack offsets
|
|
|
+
|
|
|
+We implement this optimization on x86 by skipping the checking of loads which
|
|
|
+use a fixed frame pointer offset.
|
|
|
+
|
|
|
+The result of this optimization is that patterns like reloading a spilled
|
|
|
+register or accessing a global field don't get checked. This is a very
|
|
|
+significant performance win.
|
|
|
+
|
|
|
+
|
|
|
+###### Don't check dependent loads
|
|
|
+
|
|
|
+A core part of why this mitigation strategy works is that it establishes a
|
|
|
+data-flow check on the loaded address. However, this means that if the address
|
|
|
+itself was already loaded using a checked load, there is no need to check a
|
|
|
+dependent load provided it is within the same basic block as the checked load,
|
|
|
+and therefore has no additional predicates guarding it. Consider code like the
|
|
|
+following:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ movq (%rcx), %rdi
|
|
|
+ movl (%rdi), %edx
|
|
|
+```
|
|
|
+
|
|
|
+This will get transformed into:
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ orq %rax, %rcx # Mask the pointer if misspeculating.
|
|
|
+ movq (%rcx), %rdi # Hardened load.
|
|
|
+ movl (%rdi), %edx # Unhardened load due to dependent addr.
|
|
|
+```
|
|
|
+
|
|
|
+This doesn't check the load through `%rdi` as that pointer is dependent on a
|
|
|
+checked load already.
|
|
|
+
|
|
|
+
|
|
|
+###### Protect large, load-heavy blocks with a single lfence
|
|
|
+
|
|
|
+It may be worth using a single `lfence` instruction at the start of a block
|
|
|
+which begins with a (very) large number of loads that require independent
|
|
|
+protection *and* which require hardening the address of the load. However, this
|
|
|
+is unlikely to be profitable in practice. The latency hit of the hardening
|
|
|
+would need to exceed that of an `lfence` when *correctly* speculatively
|
|
|
+executed. But in that case, the `lfence` cost is a complete loss of speculative
|
|
|
+execution (at a minimum). So far, the evidence we have of the performance cost
|
|
|
+of using `lfence` indicates few if any hot code patterns where this trade off
|
|
|
+would make sense.
|
|
|
+
|
|
|
+
|
|
|
+###### Tempting optimizations that break the security model
|
|
|
+
|
|
|
+Several optimizations were considered which didn't pan out due to failure to
|
|
|
+uphold the security model. One in particular is worth discussing as many others
|
|
|
+will reduce to it.
|
|
|
+
|
|
|
+We wondered whether only the *first* load in a basic block could be checked. If
|
|
|
+the check works as intended, it forms an invalid pointer that doesn't even
|
|
|
+virtual-address translate in the hardware. It should fault very early on in its
|
|
|
+processing. Maybe that would stop things in time for the misspeculated path to
|
|
|
+fail to leak any secrets. This doesn't end up working because the processor is
|
|
|
+fundamentally out-of-order, even in its speculative domain. As a consequence,
|
|
|
+the attacker could cause the initial address computation itself to stall and
|
|
|
+allow an arbitrary number of unrelated loads (including attacked loads of
|
|
|
+secret data) to pass through.
|
|
|
+
|
|
|
+
|
|
|
+#### Interprocedural Checking
|
|
|
+
|
|
|
+Modern x86 processors may speculate into called functions and out of functions
|
|
|
+to their return address. As a consequence, we need a way to check loads that
|
|
|
+occur after a misspeculated predicate but where the load and the misspeculated
|
|
|
+predicate are in different functions. In essence, we need some interprocedural
|
|
|
+generalization of the predicate state tracking. A primary challenge to passing
|
|
|
+the predicate state between functions is that we would like to not require a
|
|
|
+change to the ABI or calling convention in order to make this mitigation more
|
|
|
+deployable, and further would like code mitigated in this way to be easily
|
|
|
+mixed with code not mitigated in this way and without completely losing the
|
|
|
+value of the mitigation.
|
|
|
+
|
|
|
+
|
|
|
+##### Embed the predicate state into the high bit(s) of the stack pointer
|
|
|
+
|
|
|
+We can use the same technique that allows hardening pointers to pass the
|
|
|
+predicate state into and out of functions. The stack pointer is trivially
|
|
|
+passed between functions and we can test for it having the high bits set to
|
|
|
+detect when it has been marked due to misspeculation. The callsite instruction
|
|
|
+sequence looks like (assuming a misspeculated state value of `-1`):
|
|
|
+```
|
|
|
+ ...
|
|
|
+
|
|
|
+.LBB0_4: # %danger
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ shlq $47, %rax
|
|
|
+ orq %rax, %rsp
|
|
|
+ callq other_function
|
|
|
+ movq %rsp, %rax
|
|
|
+ sarq 63, %rax # Sign extend the high bit to all bits.
|
|
|
+```
|
|
|
+
|
|
|
+This first puts the predicate state into the high bits of `%rsp` before calling
|
|
|
+the function and then reads it back out of high bits of `%rsp` afterward. When
|
|
|
+correctly executing (speculatively or not), these are all no-ops. When
|
|
|
+misspeculating, the stack pointer will end up negative. We arrange for it to
|
|
|
+remain a canonical address, but otherwise leave the low bits alone to allow
|
|
|
+stack adjustments to proceed normally without disrupting this. Within the
|
|
|
+called function, we can extract this predicate state and then reset it on
|
|
|
+return:
|
|
|
+```
|
|
|
+other_function:
|
|
|
+ # prolog
|
|
|
+ callq other_function
|
|
|
+ movq %rsp, %rax
|
|
|
+ sarq 63, %rax # Sign extend the high bit to all bits.
|
|
|
+ # ...
|
|
|
+
|
|
|
+.LBB0_N:
|
|
|
+ cmovneq %r8, %rax # Conditionally update predicate state.
|
|
|
+ shlq $47, %rax
|
|
|
+ orq %rax, %rsp
|
|
|
+ retq
|
|
|
+```
|
|
|
+
|
|
|
+This approach is effective when all code is mitigated in this fashion, and can
|
|
|
+even survive very limited reaches into unmitigated code (the state will
|
|
|
+round-trip in and back out of an unmitigated function, it just won't be
|
|
|
+updated). But it does have some limitations. There is a cost to merging the
|
|
|
+state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
|
|
|
+an unmitigated caller.
|
|
|
+
|
|
|
+There is also an advantage to using this form of interprocedural mitigation: by
|
|
|
+forming these invalid stack pointer addresses we can prevent speculative
|
|
|
+returns from successfully reading speculatively written values to the actual
|
|
|
+stack. This works first by forming a data-dependency between computing the
|
|
|
+address of the return address on the stack and our predicate state. And even
|
|
|
+when satisfied, if a misprediction causes the state to be poisoned the
|
|
|
+resulting stack pointer will be invalid.
|
|
|
+
|
|
|
+
|
|
|
+##### Rewrite API of internal functions to directly propagate predicate state
|
|
|
+
|
|
|
+(Not yet implemented.)
|
|
|
+
|
|
|
+We have the option with internal functions to directly adjust their API to
|
|
|
+accept the predicate as an argument and return it. This is likely to be
|
|
|
+marginally cheaper than embedding into `%rsp` for entering functions.
|
|
|
+
|
|
|
+
|
|
|
+##### Use `lfence` to guard function transitions
|
|
|
+
|
|
|
+An `lfence` instruction can be used to prevent subsequent loads from
|
|
|
+speculatively executing until all prior mispredicted predicates have resolved.
|
|
|
+We can use this broader barrier to speculative loads executing between
|
|
|
+functions. We emit it in the entry block to handle calls, and prior to each
|
|
|
+return. This approach also has the advantage of providing the strongest degree
|
|
|
+of mitigation when mixed with unmitigated code by halting all misspeculation
|
|
|
+entering a function which is mitigated, regardless of what occured in the
|
|
|
+caller. However, such a mixture is inherently more risky. Whether this kind of
|
|
|
+mixture is a sufficient mitigation requires careful analysis.
|
|
|
+
|
|
|
+Unfortunately, experimental results indicate that the performance overhead of
|
|
|
+this approach is very high for certain patterns of code. A classic example is
|
|
|
+any form of recursive evaluation engine. The hot, rapid call and return
|
|
|
+sequences exhibit dramatic performance loss when mitigated with `lfence`. This
|
|
|
+component alone can regress performance by 2x or more, making it an unpleasant
|
|
|
+tradeoff even when only used in a mixture of code.
|
|
|
+
|
|
|
+
|
|
|
+##### Use an internal TLS location to pass predicate state
|
|
|
+
|
|
|
+We can define a special thread-local value to hold the predicate state between
|
|
|
+functions. This avoids direct ABI implications by using a side channel between
|
|
|
+callers and callees to communicate the predicate state. It also allows implicit
|
|
|
+zero-initialization of the state, which allows non-checked code to be the first
|
|
|
+code executed.
|
|
|
+
|
|
|
+However, this requires a load from TLS in the entry block, a store to TLS
|
|
|
+before every call and every ret, and a load from TLS after every call. As a
|
|
|
+consequence it is expected to be substantially more expensive even than using
|
|
|
+`%rsp` and potentially `lfence` within the function entry block.
|
|
|
+
|
|
|
+
|
|
|
+##### Define a new ABI and/or calling convention
|
|
|
+
|
|
|
+We could define a new ABI and/or calling convention to explicitly pass the
|
|
|
+predicate state in and out of functions. This may be interesting if none of the
|
|
|
+alternatives have adequate performance, but it makes deployment and adoption
|
|
|
+dramatically more complex, and potentially infeasible.
|
|
|
+
|
|
|
+
|
|
|
+## High-Level Alternative Mitigation Strategies
|
|
|
+
|
|
|
+There are completely different alternative approaches to mitigating variant 1
|
|
|
+attacks. [Most](https://lwn.net/Articles/743265/)
|
|
|
+[discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
|
|
|
+specific known attackable components in the Linux kernel (or other kernels) by
|
|
|
+manually rewriting the code to contain an instruction sequence that is not
|
|
|
+vulnerable. For x86 systems this is done by either injecting an `lfence`
|
|
|
+instruction along the code path which would leak data if executed speculatively
|
|
|
+or by rewriting memory accesses to have branch-less masking to a known safe
|
|
|
+region. On Intel systems, `lfence` [will prevent the speculative load of secret
|
|
|
+data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
|
|
|
+On AMD systems `lfence` is currently a no-op, but can be made
|
|
|
+dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
|
|
|
+code path ([mitigation G-2 +
|
|
|
+V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
|
|
|
+
|
|
|
+However, this relies on finding and enumerating all possible points in code
|
|
|
+which could be attacked to leak information. While in some cases static
|
|
|
+analysis is effective at doing this at scale, in many cases it still relies on
|
|
|
+human judgement to evaluate whether code might be vulnerable. Especially for
|
|
|
+software systems which receive less detailed scrutiny but remain sensitive to
|
|
|
+these attacks, this seems like an impractical security model. We need an
|
|
|
+automatic and systematic mitigation strategy.
|
|
|
+
|
|
|
+
|
|
|
+### Automatic `lfence` on Conditional Edges
|
|
|
+
|
|
|
+A natural way to scale up the existing hand-coded mitigations is simply to
|
|
|
+inject an `lfence` instruction into both the target and fallthrough
|
|
|
+destinations of every conditional branch. This ensures that no predicate or
|
|
|
+bounds check can be bypassed speculatively. However, the performance overhead
|
|
|
+of this approach is, simply put, catastrophic. Yet it remains the only truly
|
|
|
+"secure by default" approach known prior to this effort and serves as the
|
|
|
+baseline for performance.
|
|
|
+
|
|
|
+One attempt to address the performance overhead of this and make it more
|
|
|
+realistic to deploy is [MSVC's /Qspectre
|
|
|
+switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
|
|
|
+Their technique is to use static analysis within the compiler to only insert
|
|
|
+`lfence` instructions into conditional edges at risk of attack. However,
|
|
|
+[initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
|
|
|
+[analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
|
|
|
+has shown that this approach is incomplete and only catches a small and limited
|
|
|
+subset of attackable patterns which happen to resemble very closely the initial
|
|
|
+proofs of concept. As such, while its performance is acceptable, it does not
|
|
|
+appear to be an adequate systematic mitigation.
|
|
|
+
|
|
|
+
|
|
|
+## Performance Overhead
|
|
|
+
|
|
|
+The performance overhead of this style of comprehensive mitigation is very
|
|
|
+high. However, it compares very favorably with previously recommended
|
|
|
+approaches such as the `lfence` instruction. Just as users can restrict the
|
|
|
+scope of `lfence` to control its performance impact, this mitigation technique
|
|
|
+could be restricted in scope as well.
|
|
|
+
|
|
|
+However, it is important to understand what it would cost to get a fully
|
|
|
+mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
|
|
|
+using all of the tricks to improve performance (so leaves the low 2gb
|
|
|
+unprotected and +/- 2gb surrounding any PC in the program). We ran both
|
|
|
+Google's microbenchmark suite and a large highly-tuned server built using
|
|
|
+ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
|
|
|
+instructions, and benchmarks were run on large Haswell servers. We collected
|
|
|
+data both with an `lfence`-based mitigation and load hardening as presented
|
|
|
+here. The summary is that mitigating with load hardening is 1.77x faster than
|
|
|
+mitigating with `lfence`, and the overhead of load hardening compared to a
|
|
|
+normal program is likely between a 10% overhead and a 50% overhead with most
|
|
|
+large applications seeing a 30% overhead or less.
|
|
|
+
|
|
|
+| Benchmark | `lfence` | Load Hardening | Mitigated Speedup |
|
|
|
+| -------------------------------------- | -------: | -------------: | ----------------: |
|
|
|
+| Google microbenchmark suite | -74.8% | -36.4% | **2.5x** |
|
|
|
+| Large server QPS (using ThinLTO & PGO) | -62% | -29% | **1.8x** |
|
|
|
+
|
|
|
+Below is a visualization of the microbenchmark suite results which helps show
|
|
|
+the distribution of results that is somewhat lost in the summary. The y-axis is
|
|
|
+a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
|
|
|
+-> better). Each box-and-whiskers represents one microbenchmark which may have
|
|
|
+many different metrics measured. The red line marks the median, the box marks
|
|
|
+the first and third quartiles, and the whiskers mark the min and max.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
|
|
|
+work on getting that. Still, the above should give a pretty clear
|
|
|
+characterization of the performance, and specific benchmarks are unlikely to
|
|
|
+reveal especially interesting properties.
|
|
|
+
|
|
|
+
|
|
|
+### Future Work: Fine Grained Control and API-Integration
|
|
|
+
|
|
|
+The performance overhead of this technique is likely to be very significant and
|
|
|
+something users wish to control or reduce. There are interesting options here
|
|
|
+that impact the implementation strategy used.
|
|
|
+
|
|
|
+One particularly appealing option is to allow both opt-in and opt-out of this
|
|
|
+mitigation at reasonably fine granularity such as on a per-function basis,
|
|
|
+including intelligent handling of inlining decisions -- protected code can be
|
|
|
+prevented from inlining into unprotected code, and unprotected code will become
|
|
|
+protected when inlined into protected code. For systems where only a limited
|
|
|
+set of code is reachable by externally controlled inputs, it may be possible to
|
|
|
+limit the scope of mitigation through such mechanisms without compromising the
|
|
|
+application's overall security. The performance impact may also be focused in a
|
|
|
+few key functions that can be hand-mitigated in ways that have lower
|
|
|
+performance overhead while the remainder of the application receives automatic
|
|
|
+protection.
|
|
|
+
|
|
|
+For both limiting the scope of mitigation or manually mitigating hot functions,
|
|
|
+there needs to be some support for mixing mitigated and unmitigated code
|
|
|
+without completely defeating the mitigation. For the first use case, it would
|
|
|
+be particularly desirable that mitigated code remains safe when being called
|
|
|
+during misspeculation from unmitigated code.
|
|
|
+
|
|
|
+For the second use case, it may be important to connect the automatic
|
|
|
+mitigation technique to explicit mitigation APIs such as what is described in
|
|
|
+http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
|
|
|
+to switch from automatic to manual mitigation without immediately exposing a
|
|
|
+hole. However, the design for how to do this is hard to come up with until the
|
|
|
+APIs are better established. We will revisit this as those APIs mature.
|