Friday, January 5, 2018

More about Spectre and the PowerPC (or why you may want to dust that G3 off)

UPDATE: IBM is releasing firmware patches for at least the POWER7+ and forward, including the POWER9 expected to be used in the Talos II. My belief is that these patches disable speculative execution through indirect branches, making the attack much more difficult though with an unclear performance cost. See below for why this matters.

UPDATE the 2nd: The G3 and 7400 survived Spectre!


(my personal favourite Blofeld)

Most of the reports on the Spectre speculative execution exploit have concentrated on the two dominant architectures, x86 (in both its AMD and Meltdown-afflicted Intel forms) and ARM. In our last blog entry I said that PowerPC is vulnerable to the Spectre attack, and in broad strokes it is. However, I also still think that the attack is generally impractical on Power Macs due to the time needed to meaningfully exfiltrate information on machines that are now over a decade old, especially with JavaScript-based attacks even with the TenFourFox PowerPC JIT (to say nothing of various complicating microarchitectural details). But let's say that those practical issues are irrelevant or handwaved away. Is PowerPC unusually vulnerable, or on the flip side unusually resistant, to Spectre-based attacks compared to x86 or ARM?

For the purposes of this discussion and the majority of our audience, I will limit this initial foray to processors used in Power Macintoshes of recent vintage, i.e., the G3, G4 and G5, though the G5's POWER4-derived design also has a fair bit in common with later Power ISA CPUs like the Talos II's POWER9, and ramifications for future Power ISA CPUs can be implied from it. I'm also not going to discuss embedded PowerPC CPUs here such as the PowerPC 4xx since I know rather less about their internal implementational details.

First, let's review the Spectre white paper. Speculative execution, as the name implies, allows the CPU to speculate on the results of an upcoming conditional branch instruction that has not yet completed. It predicts future program flow will go a particular way and executes that code upon that assumption; if it guesses right, and most CPUs do most of the time, it has already done the work and time is saved. If it guesses wrong, then the outcome is no worse than idling during that time save the additional power usage and the need to restore the previous state. To do this execution requires that code be loaded into the processor cache to be run, however, and the cache is not restored to its previous state; previously no one thought that would be necessary. The Spectre attack proves that this seemingly benign oversight is in fact not so.

To determine the PowerPC's vulnerability requires looking at how it does branch prediction and indirect branching. Indirect branching, where the target is determined at time of execution and run from a register rather than coding it directly in the branch instruction, is particularly valuable for forcing the processor to speculatively execute code it wouldn't ordinarily run because there are more than two possible execution paths (often many, many more, and some directly controllable by the attacker).

The G3 and G4 have very similar branch prediction hardware. If there is no hinting information and the instruction has never been executed before (or is no longer in the branch history table, read on), the CPU assumes that forward branches are not taken and backwards branches are, since the latter are usually found in loops. The programmer can add a flag to the branch instruction to tell the CPU that this initial assumption is probably incorrect (a static hint); we use this in a few places in TenFourFox explicitly, and compilers can also set hints like this. All PowerPC CPUs, including the original 601 and the G5 as described below, offer this level of branch prediction at minimum. Additionally, in the G3 and G4, branches that have been executed then get an entry in the BHT, or branch history table, which over multiple executions records if the branch is not taken, probably not taken, probably taken or taken (in Dan Luu's taxonomy of branch predictors, this would be two-level adaptive, local). On top of this the G3 and G4 have a BTIC, or branch target instruction cache, which handles the situation of where the branch gets taken: if the branch is not taken, the following instructions are probably in the regular instruction cache, but if the branch is taken, the BTIC allows execution to continue while the instruction queue continues fetching from the new program counter location. The G3 and 7400-series G4 implement a 512-entry BHT and 64-entry, two-instruction BTIC; the 7450-series G4 implements a 2048-entry BHT and a 128-entry, four-instruction BTIC, though the actual number of instructions in the BTIC depends on where the fetch is relative to the cache block boundary. The G3 and 7400 G4 support speculatively executing through up to two unresolved branches; the 7450 G4e allows up to three, but also pays a penalty of about one cycle if the BTIC is used that the others do not.

The G5 (and the POWER4, and most succeeding POWER implementations) starts with the baseline above, though it uses a different two-bit encoding to statically hint branch instructions. Instead of the G3/G4 BHT scheme, however, the G5/970 uses what Luu calls a "hybrid" approach, necessary to substantially improve prediction performance in a CPU for which misprediction would be particularly harmful: a staggering 16,384-entry BHT but also an additional 16,384-entry table using an indexing scheme called gshare, and a selector table which tells the processor which table to use; later POWER designs refine this further. The G5 does not implement a BTIC probably because it would not be compatible with how dispatch groups work. The G5 can predict up to two branches per cycle, and have up to 16 unresolved branches.

The branch prediction capabilities of these PowerPC chips are not massively different from other architectures'. The G5's ability to keep a massive number of unresolved branch instructions in flight might make it actually seem a bit more subject to such an attack since there are many more opportunities to load victim process data into the cache, but the basic principles at work are much the same as everything else, so none of our chips are particularly vulnerable or resistant in that respect. Where it starts to get interesting, however, is when we talk about indirect branches. There is no way in the Power ISA to directly branch to an address in a register, an unusual absence as such instructions exist in most other contemporary architectures such as x86, ARM, MIPS and SPARC. Instead, software must load the instruction into either of two special purpose registers that allow branches (either the link register "LR" or the counter register "CTR") with a special instruction (mtctr and mtlr, both forms of the general SPR instruction mtspr) and branch to that, which can occur conditionally or unconditionally. (We looked at this in great detail, with microbenchmarks, in an earlier blog post.)

To be able to speculatively execute an indirect branch, even an unconditional one, requires that either LR or CTR be renamed so that its register state can be saved as well, but on PowerPC they are not general purpose registers that can use the regular register rename file like other platforms such as ARM. The G5, unfortunately in this case, has additional hardware to deal with this problem: to back up the 16 unresolved branches it can have in-flight, LR and CTR share a 16-entry rename mapper, which allows the G5 to speculatively execute a combination of up to 16 LR or CTR-referencing branches (i.e., b(c)lr and b(c)ctr). This could allow a lot of code to be run speculatively and change the cache in ways the attacker could observe. Substantial preparation would be required to get the G5's branch history fouled enough to make it mispredict due to its very high accuracy (over 96%), but if it does, the presence of indirect branches will not slow the processor's speculative execution down what is now the wrong path. This is at least as vulnerable as the known Spectre-afflicted architectures, though the big cost of misprediction on the G5 would make this type of blown speculation especially slow. Nevertheless, virtually all current POWER chips would fall in this hole as well.

But the G3 and G4 situation is very different. The G3 actually delays fetch and execution at a b(c)ctr until the mtctr that leads it has completed, meaning speculative execution essentially halts at any indirect branch. The same applies for the LR, and for the 7400. CTR-based indirect branching is very common in TenFourFox-generated code for JavaScript inline caches, and code such as mtlr r0:blr terminates nearly every PowerPC function call. No fetch, and therefore no speculative execution, will occur until the special purpose register is loaded, meaning the proper target must now be known and there is less opportunity for a Spectre-based attack to run. Even if the processor could continue speculation past that point, the G3 and 7400 implement only a single rename register each for LR and CTR, so they couldn't go past a second such sequence regardless.

The 7450 is a little less robust in this regard. If the instruction sequence is an unconditional mtlr blr, the 7450 (and, for that matter, the G5) implements a link stack where the expected return address comes off a stack of predicted addresses from prior LR-modifying instructions. This is enough of a hint on the 7450 G4e to possibly allow continued fetch and potential speculation. However, because the 7450 also has only a single rename register each for LR and CTR, it also cannot speculatively execute past a second such sequence. If the instruction sequence is mtlr bclr, i.e., there is a condition on the LR branch, then execution and therefore speculation must halt until either the mtlr completes or the condition information (CR or CTR) is available to the CPU. But if the special purpose register is the CTR, then there is no address cache stack available, and the G4e must delay at an mtctr b(c)ctr sequence just like its older siblings.

Bottom line? Spectre is still not a very feasible means of attack on Power Macs, as I have stated, though the possibilities are better on the G5 and later Power ISA designs which are faster and have more branch tricks that can be subverted. But the G3 and the G4, because of their limitations on indirect branching, are at least somewhat more resistant to Spectre-based attacks because it is harder to cause their speculative execution pathways to operate in an attacker-controllable fashion (particularly the G3 and the 7400, which do not have a link stack cache). So, if you're really paranoid, dust that old G3 or Sawtooth G4 off. You just might have the Yosemite that manages to survive the computing apocalypse.

11 comments:

  1. Very impressed by the depth of your analysis. And proud of my Pismo PowerBook and my little iBook G3. Both are still in active use (no dusting required), and especially the 800 MHz iBook (with an SSD upgrade) is still a viable and speedy machine in most situations.

    ReplyDelete
  2. very interesting. any comments on POWER7 machines? my guess is that it could be worse than for the G5

    ReplyDelete
  3. I've been able to make the spectre POC work on a 7450, but not on a 7400.
    I guess it has something to do with longer speculative execution

    ReplyDelete
    Replies
    1. Is your working code anywhere for download? I'd like to do some further analysis.

      Delete
    2. http://nanard.free.fr/spectre-PowerPC.tar.gz
      https://gist.github.com/miniupnp/9b701e87f14ad3e0a455cfb54ba99fed

      Delete
    3. While this does work quite well on my iMac G5 (970FX), it fails to recover even a single character on my PowerBook G4 (7447A).

      Delete
    4. My 7400-based fileserver was indeed unable to recover any characters either, no matter how hard I tried. But this Quad G5 was highly timing sensitive. Without sufficient optimization or without G5 tuning, it would occasionally drop characters. At -O3 and/or -mcpu=ppc970, however, it was character-perfect. This sounds like another blog post is in order :)

      Delete
    5. Interesting. Built with -mcpu=970 or no CPU target it fails on my 970MP running FreeBSD regardless of optimization level. However, built with -mcpu=7450 -m32, it manages to get everything (again, on my 970MP).

      Delete
  4. Came here looking exactly for that, wasn't disappointed.
    Thank you sir.

    Shiunbird

    ReplyDelete
  5. Great article! Just everything I was looking for, thank you very much!

    ReplyDelete
  6. Thanks for the info. Do you know, is the PS3 effected?

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.