Sunday, February 12, 2012

10.0.2pre available, now with awesome, plus: more G5 optimization notes

After the unexpected 10.0.1 release comes now the beta that was supposed to be 10.0.1pre and is now 10.0.2pre. This is the next big leap in JM+TI that reworks branches from the simplistic always-far calls in 10.0.0 and 10.0.1 to a set of intelligent algorithms that try to favour the branch prediction of either the G3/G4 or the G5, both written by Ben. It also includes Dave's square root routine, slightly altered for efficiency. The G3/G4 square root version improves square-root heavy code significantly over the old JavaScript stub function version; V8 Raytrace improves by almost 12% (the G5 has square root in hardware, but this is a very good implementation). If I may say, it's a nice example of why POWER ISA is, well, powerful. Besides using the fast reciprocal square root estimate instruction in the 603 and up, it also makes heavy use of the built-in FPU fused multiply-add to do Newton's method in fewer instructions. I might also add that x86 only just added FMA; AMD finally implements FMA (as FMA4) in Bulldozer, but Intel won't have what PowerPC had in 1992 until Haswell in 2013 (and even then only as FMA3). So there.

Ben's branchwork is the centrepiece of this release, however, and while it improves SunSpider by a modest amount, it improves loopy benchmarks like V8 by a huge degree. My quad G5 improves by about 45%, for example, and gets consistently around 900-950ms in SunSpider (down from around 1050). Our 1GHz 7450 G4 doesn't improve as much on SunSpider (2750 down to around 2600, so not quite at our AWOAFY? target), but still improves by about 40% on V8. Part of achieving this is splitting the way branches are handled into "big POWER" (G5) and "little POWER" (G3/G4) versions. Ben's original work did much what my code in our dear nearly-departed tracejit did, which was to have four-word branch stanzas padded with nops so that if a branch target was too big for a regular b[l] or bc[l] instruction (the normal relative branching instructions on PowerPC), we had enough room to turn it into lis ori mtctr b[c]ctr[l] which load the destination address into a register (usually r0), transfer it to the CTR, and then branch to the CTR (conditionally or always). This achieved 45% on G5 in V8 and about 35% on G4. SunSpider dropped down to less than 920 on the G5 as well.

So this was already a great start, but Ben's next brainwave was to free up the G3/G4's limited cache by reducing the branch stanzas further to two words, either branching to the target as usual, or if not possible, branching to a "trampoline" (not to be confused with the trampoline the JavaScript interpreter uses to enter JITted code, which is common to all and I handwrote in assembly language) in a construct called the constant pool, which has the actual far call. The constant pool is part of the JS runtime, provided by Mozilla normally for the ARM JIT where they dump constants to be referenced by the JIT code, but it doesn't have to be used for that. By doing it this way, Ben keeps more running code in cache, and as predicted, on the G4 this improved performance by another 3-4% in aggregate. (Ben later added another piece that only uses the trampoline when absolutely necessary, which fractionally improved this number further on G4.)

On the G5, however, this actually hurt performance and SunSpider climbed to a poorer result, nearly 1100ms; even with the later tweak to reduce trampoline usage, it was still around 970ms. Our theory is that the G5, being (in Apple's words) "very hungry, very fast and very sequential," pays too big an aggregate penalty to branch to an out-of-line branch stanza when a far call is encountered, for two reasons. First, it appears to be a smaller penalty (possibly even near zero given the aggressive ordering of the G5 dispatch unit) to have empty nops inline that take up some small proportion of instruction cache, because when those empty instructions are patched to a far call in-place the G5 does not need to introduce bubbles in its pipeline doing a branch into the trampoline just to branch again. In addition, the hypernerds amongst you will recall from our previous treatise on G5 optimization that there can only be one branch instruction in a dispatch group. The trampoline version must run in (at least) two dispatch groups, because there are two branch instructions, one to the far call in the trampoline and one in the trampoline itself, and both will each introduce a pipeline bubble of variable length. The far call in-place will still introduce a bubble, but the entire branch can in the best case execute in a single dispatch group because there is only one branch (the branch-to-CTR instruction at the end), and there will be only one bubble.

Because the G5 is really just a POWER4 with a deeper pipeline and AltiVec, this property is likely shared by later "big POWER" CPUs like the POWER5, POWER6 and POWER7, as well as "big POWER-like" CPUs such as the G5, Cell PPE and Xenon. We will likely have consumers that will want this branch optimization strategy, but we don't want to lose the gains we get on "little POWER" (such as G3, G4, e500, QorIQ, Gekko/Broadway and PowerPC 4xx) with the cache-saving trampoline approach, so we do both. On the G5, the original four-word stanza branching is compiled in; everything else (G3, 7400 and 7450) use the two-word branch stanza with the constant pool trampoline. The best of both worlds is thus achieved.

One final note on G5 optimization: I tested compiling the browser with 32-byte-aligned blocks and labels in the JIT allocator, and that slowed things down too (it is not obvious whether this can be more fine-grained). For that matter, when I tried building the browser with 32-byte-aligned loops, jumps, functions and branch targets, that too slowed the browser over the 16-byte-alignment it uses now. It appears to be all a balancing act.

10.0.2-final will come out at the same time as the ESR release. I also plan to write the debug only 11 fairly soon. Please note that I will be transferring service from the Apple Network Server to the POWER6 this coming (USA) holiday weekend, so there may be some intermittent weirdness the weekend of the 18th/19th/20th. In the meantime, please grab a beta build and give it a spin on your architecture:

12 comments:

  1. This is is very noticeable improvement. Peacekeeper is about 12% faster on my G4 PowerBook, which says a lot (more test results later). Thanks to all coders!

    ReplyDelete
  2. I'm consistently getting considerably better results with 10.0.2pre on both G3 and G4.

    PowerBook G3 400 MHz

    TFF 10.0.1
    Sunspider 0.9.1: 5677.9 +/- 1.6%
    Peacekeeper (new): 77
    Dromaeo (V8 only): 10.49runs/s

    TFF 10.0.2pre
    Sunspider 0.9.1: 5508.1ms +/- 1.1%
    Peacekeeper (new): 82
    Dromaeo (V8 only): 13.44runs/s


    PowerBook G4 1333 MHz

    TFF 10.0.1
    Sunspider 0.9.1: 2256.9ms +/- 0.8%
    Peacekeeper (new): 277
    Dromaeo (all JS tests): 51.78runs/s
    Dromaeo (V8 only): 34.25runs/s
    Kraken: 43560.5ms +/- 2.9%

    TFF 10.0.2pre
    Sunspider 0.9.1: 2086.6ms +/- 1.2%
    Peacekeeper (new): 310
    Dromaeo (all JS tests): 75.35runs/s
    Dromaeo (V8 only): 48.25runs/s
    Kraken: 37429.5ms +/- 0.8%

    ReplyDelete
  3. I'm not sure this is the place to report this but here is a bug I've seen consistently from at least v.7 to 10pre (might appear in 4-5 too) This site is a perfect example --> http://classic.battle.net/diablo2exp/

    Scrolling down the page messes all the text...

    I'm using the G4e build, in case this bug doesn't affect G5s.

    ReplyDelete
    Replies
    1. I have noticed the same bug on quite a few sites, using the same build.

      My Sunspider results for a 1ghz G4:

      2408.6ms +/- 1.3%

      That's down from ~2700.

      Delete
  4. Hm, there's a weird transparency in that website that messes up the composition. Switch to another tab and back, then you'll see it. Urgs. Looks like the old plugin smear when scrolling, but there seem to be no plugins involved at first glance.

    Nova, thanks for reporting, I'll look into it more closely and file a bug if necessary. You can use our support site if you encounter any other problems with TFF: http://tenfourfox.tenderapp.com/

    ReplyDelete
  5. This is another instance of Mozilla bug 720035. It affects regular Firefox also, but Mozilla hasn't done much with it.

    ReplyDelete
  6. I have whittled it down to a simple test case:

    http://pastehtml.com/view/bo14fk5r6.html

    ReplyDelete
  7. Nova, you can create an Adblock filter as a quick workaround:
    http://classic.battle.net/images/battle/diablo2exp/images/blackbg.gif

    Our regression window is between TFF 6.0.1 and TFF 7.0
    Interestingly, Firefox 9.0 and 10.0.1 on Win XP display the website ok (VPC, no hardware acceleration), so I was just about to file an issue.

    ReplyDelete
  8. It's Cairo. I'm trying to get some suggestions from the Mozilla gfx team. It appears on 11 on the Intel mini, too, so it is not technically our bug (but I'm looking at a fix since it affects us more than others).

    ReplyDelete
  9. Works fine on FF 10.0.1 & 11 on my macBook.

    ReplyDelete
  10. Please, read the Mozilla bug. It occurs with hardware acceleration off, and your MacBook probably has it enabled. My mini doesn't precisely so I can use it for testing. Seriously, this bug is understood and there is already an (inconvenient) workaround.

    ReplyDelete
  11. I am shocked by Chris' reported scores of 300+ on peacekeeper (in MJ/TI no doubt)

    Can report a two-bounce load on the very same hardware

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.