Thursday, January 6, 2011

How the bench was run, and what it got us*

*apologies to Michael Stipe

First some laundry list stuff. I've heard rumours about a Fink version of 10.4Fx, which sounds great. I see fangism is following this blog and I think this is his work, so ping me in the comments and maybe we can get your info into the project wiki.

Second, beta 9. Mozilla is apparently shrinking the scope for it, which is a good sign in that they are probably approaching release candidate state finally. At this rate, we should have Firefox 4 just in time for the USA presidential election. ;) This coincides nicely with the release of the nanojit, so there will be a beta 9 since it is convenient to do so. Mozilla's time frame is shooting for a code freeze, like, tomorrow, though historically their estimates have been at times wildly optimistic. That said, however, if their forecasted schedule is in any way accurate our beta 9 may lag it slightly instead of being ahead as usual. I'll explain, but first, the nanojit.

I implied in my last post that the nanojit is unfortunately not the dramatic leap I was hoping for (notice that I did not say how the bench[mark] was won), and those of you who know me at 68KMLA have probably read my post on that subject. To understand why this is a mixed bag, I should explain what the nanojit actually does, which is also useful to Mozilla hackers at large since there is precious little MDN documentation on it.

When the tracer thinks it has hot code, it turns on a trace recorder and slurps up the JavaScript opcodes as they come down the "pipeline." These opcodes are not executed as is (as the interpreter would), but are turned into an intermediate language called LIR. LIR is also the intermediate language for Flash ActionScript, as they are both part of Tamarin, though SpiderMonkey (Mozilla's JavaScript implementation) does not use other parts of the ActionScript interpreter.

After an initial spew of LIR opcodes and/or optimization, the opcodes are then handed to the nanojit assembler (in the source tree, this is in js/src/nanojit/Assembler.cpp). This section is more or less architecture-agnostic, which is important later on. It goes through each of the opcodes and as needed to generate code calls specific portions of the architecture-specific nanojit (in our case, NativePPC.cpp) which then write instructions to memory, starting with a standard function prologue. This is not done using things like __asm__(), but using a collection of macros and an opcode table to generate 32-bit instructions directly. If there is some concern about the result of an operation, a guard may be inserted as an internal consistency check, and side exits are made available to abort to the interpreter if funny business is detected. Finally, at least the first time through, an epilogue is emitted. As more code is written and more labels are fixed in memory, the assembler will go through and patch the branches in much the way a two-pass assembler would, eventually terminating in the epilogue that was originally emitted.

When it comes time to execute, the tracer makes certain statistical decisions about how useful the trace will be relative to simply falling back on the interpreter and runs one or the other. If a trace is deemed to be detrimental or pathological for performance, it is blacklisted. Otherwise, once it is compiled, it is used as long as it remains valid or until garbage collection destroys it, the rules of which are still not yet determined for Firefox 4.

As originally written, the PPC nanojit worked fine for Tamarin, but was not sufficient for Firefox. It did not have an implementation of nFragExit, which is needed to manage linkages with other code fragments and the epilogue, nor could it deal with the situation of a branch with a 14-bit or 24-bit displacement needing to be patched to a larger displacement (i.e., turning simple branch instructions into things such as mtctr b(c)ctr), and it did not implement the overflow arithmetic instructions, which use guards to catch math overflow and are heavily relied upon by Mozilla.

My first pass, then, was simply to add those features (I am grateful to Edwin Smith's help here, the original author, who gave me the critical pointers to get started). This was enough to get the browser up and running, but it didn't seem much faster, and SunSpider numbers were disastrous: on my quad G5 in Reduced power, the benchmark took around 7000ms on the stock interpreter in beta 8, but on "TraceMonkey PPC" it took nearly 10000ms -- almost half again as long! Dromaeo was less discouraging, and it did extremely well on the microbenchmarks, but its aggregate score was seriously hurt by the SunSpider dependency causing it to still be slightly slower in the final summation. (Please note that no PowerPC browser with the possible exception of Safari 5 on Leopard can run Kraken without having UI stalls, so I didn't bother.)

The suckage was, fortunately, proportional. There was no specific penalty paid other than that of the tracer and assembler themselves, and using Apple's built-in MakeDataExecutable() function -- which is still part of Carbon, but is only available on PPC -- to flush the instruction cache was actually faster than a hand-rolled routine. Less complex routines ran in proportionately less time, so the problem was the code it was generating.

It is instructive to look at the example of a simple loop:

var i, j = 0; for(i=0;i<2999999;i++) { j += 2; } print(j);

This can be written extremely fast in PowerPC assembly, especially being a single loop that can easily take advantage of the CTR branch instructions to do multiple operations simultaneously. Indeed, as a ceiling test I handwrote this into assembly and fed it to gcc, and the result was too fast to time accurately. However, when compiled by the nanojit, it took around 5000ms (!). Here's what it generates, edited for space and clarity. You can see this yourself by building a debug version of TenFourFox and starting it up with the environment TMFLAGS set to native (there are loads of other options, set it to help to see them):


/src/mozilla-central/obj-ff-dbg/dist/TenFourFoxDebug.app/Contents/MacOS/% ./js -b -d -j -p ~/empty.js
[ ... ]
0x3565ed0 [prologue]
0x3565ed0 mflr r0
0x3565ed4 stw r0,8(sp)
0x3565ed8 stw r31,4(sp)
0x3565edc mr r31,sp
0x3565ee0 stwu sp,-144(sp)
; single target patch 03565fd4 -> 03565f30 [ I'll explain this in a moment ]
0x3565ee4 stw r13,-72(r31) <= spill r13
0x3565ee8 stw r14,-68(r31) <= spill r14
[ .. rest of the prologue .. ]
0x3565f28 stw r30,-4(r31) <= spill r30
0x3565f2c stw r3,-76(r31) <= spill state
0x3565f30 [label1] ; the actual meat starts here
0x3565f30 lwz r28,8(r3)
0x3565f34 lwz r29,12(r3)
0x3565f34 ; pc now 03565f34

0x3565f38 ; swapped @ 03565f38

0x3565f38 lis r30,801 (0x3210000)
0x3565f3c ori r30,r30,57368 (0xe018)
0x3565f40 lwz r30,0(r30)
0x3565f44 cmpwi cr7,r30,0 (0x0)
0x3565f48 beq+ cr7,0x3565f58
0x3565f4c b 0x3575dec ; this goes to an exit block
0x3565f50 nop
0x3565f54 nop
0x3565f58 lwz r30,1560(r29)
0x3565f5c stw r30,0(r28)
0x3565f5c ; pc now 03565f5c

0x3565f60 ; swapped @ 03565f60

0x3565f60 li r27,2 (0x2)
0x3565f64 stw r27,8(r28)
0x3565f68 addo r30,r30,r27
0x3565f6c mcrxr cr7
0x3565f70 ble+ cr7,0x3565f84
0x3565f74 lis r0,855 (0x3570000)
0x3565f78 ori r0,r0,24140 (0x5e4c)
0x3565f7c mtctr r0
0x3565f80 bctr
0x3565f84 stw r30,1560(r29)
0x3565f88 lwz r30,1552(r29)
0x3565f88 ; pc now 03565f88

0x3565f8c ; swapped @ 03565f8c

0x3565f8c li r27,1 (0x1)
0x3565f90 addo r30,r30,r27
0x3565f94 mcrxr cr7
0x3565f98 ble+ cr7,0x3565fac
0x3565f9c lis r0,855 (0x3570000)
0x3565fa0 ori r0,r0,24236 (0x5eac)
0x3565fa4 mtctr r0
0x3565fa8 bctr ; this also goes to an exit block
0x3565fac stw r30,1552(r29)
0x3565fb0 stw r30,0(r28)
0x3565fb0 ; pc now 03565fb0

0x3565fb4 ; swapped @ 03565fb4

0x3565fb4 lis r29,45 (0x2d0000)
0x3565fb8 ori r29,r29,50879 (0xc6bf)
0x3565fbc stw r29,8(r28)
0x3565fbc ; pc now 03565fbc

0x3565fc0 ; swapped @ 03565fc0

0x3565fc0 cmpw cr7,r30,r29
0x3565fc4 blt+ cr7,0x3565fd4
0x3565fc8 b 0x3575f0c ; and this also goes to an exit block
0x3565fcc nop
0x3565fd0 nop
0x3565fd4 lis r2,0 (0x0) ; this is patched by the single target patch above
0x3565fd8 ori r2,r2,0 (0x0)
0x3565fdc mtctr r2
0x3565fe0 bctr
0x3565fe4 b 0x3575f6c
[ .. skip memory to epilogue .. ]
0x3575f6c lwz r3,-76(r31) <= restore state
0x3575f70 lwz r13,-72(r31) <= restore r13
[ ... ]
0x3575fb0 lwz r29,-8(r31) <= restore r29
0x3575fb4 lwz r30,-4(r31) <= restore r30
0x3575fb8 lis r12,1152 (0x4800000)
0x3575fbc ori r12,r12,31964 (0x7cdc)
0x3575fc0 lis r2,855 (0x3570000)
0x3575fc4 ori r2,r2,24528 (0x5fd0)
0x3575fc8 mtctr r2
0x3575fcc bctr
0x3575fd0 mr r3,r12
0x3575fd4 mr sp,r31
0x3575fd8 lwz r31,4(sp)
0x3575fdc lwz r0,8(sp)
0x3575fe0 mtlr r0
0x3575fe4 blr


Ugh!

For the sake of fairness, I have left the guards out as they have a very minimal impact on runtime, and the prologue and epilogue are fairly standard ABI stuff. But this is a lot of code for a small loop!

As originally written, the branches were not hinted, and my first attempt was to hint the branches with likely bits and allow it to optimize a patch down to a smaller set of instructions where possible. Notice the code near the end that loads CTR with 0x0 and tries to jump to it -- that is recognized by another nanojit function called nPatchBranch that (supplies!) patches branches as a long jump and the correct target is inserted into the lis-ori by the patch routine. If later in runtime the branch had to be changed, if it could be turned into a 14-bit branch with fewer instructions, it would be and the rest simply nopped out. Likewise, the nops in the code are there for room if a branch has to be "demoted" to a slower far call. This didn't help much, so I loaded the code into Shark.

Shark showed several things of note. First, as you scan the code, you will notice lots of potential pipeline stalls (ignore the swapping comments, I'll get to that). While the G5 is fairly aggressive with instruction reordering, it is not perfect, and older CPUs are less effective. Second, the branches are really hot. Third, there are a lot of store instructions inside the loop body proper aggravating the stall situation. Fourth, it's decrementing the loop the hard way.

To help with instruction ordering, I rewrote the macro that emits instructions to see if it could dynamically reorder them with the preceding instruction (call it "peephole optimization" if you like). You can see where the swapper worked in the instruction stream (just swap in your mind the instructions immediately around the swapper's log note). If there were no register dependencies and the code did not need to be emitted in strict serialized order, the instructions were swapped, which helped to turn things like


li r0,...
ori r0,...
li r1,...
ori r1,...


into a more typical and efficient


li r0,...
li r1,...
ori r0,...
ori r1,...


and cool off the registers. Although the G5 could do some of this in hardware, it helped with the G4s. A special attempt was made with comparison instructions to turn st cmp b into cmp st b so that the comparison would be more likely to be computed by the time the branch came along, and a store instruction could be safely switched with it since it would not affect the comparison. Finally, expensive instructions such as integer multiply were strength-reduced into repeated smaller operations (in mulli's case, into repeated strategic adds). These improvements yielded about 5-10% additional performance.

Still, the really ugly instructions such as branching could not be totally eliminated, and the memory stores could not be moved out of the loop because they stored stuff in the runtime stack that the side exits (if they were taken) might require. This was really painful for a processor like the PPC that is strongly oriented to load-store (whereas x86 is much better at memory operations) and had pipelining consequences. There is no way right now for the PPC nanojit to get a bird's eye view of what it's compiling so that it can more efficiently order larger functional blocks, such as moving the stack store operations into a side exit or the epilogue, or rewriting loops with the special PPC loop instructions if that sort of situation can be recognized. Perhaps that is more the domain of the methodjit, but it would be nice to have such meta-analysis available for the nanojit as well.

It could hurt less if the tracer were statistically able to predict accurately for the PowerPC which operations would be more painful for runtime, but tuning these parameters yielded small margins to date.

So am I throwing in the towel with the nanojit? No. For one thing, it still is faster at certain operations, and it can cache code it compiles (which is beneficial to L2-fat processors). While code would have to execute a large number of times to be worth it, things like the browser chrome do indeed execute a large number of times, as do many AJAXy applications. For another thing, I want to get more eyes on it to see if we as a group can make it better. While trying to beat a powerful optimizing compiler with a kludgey code generator is generally a losing proposition, there are probably other optimizations I haven't thought of.

The best and final solution would be to change some aspects of the nanojit to allow the assembler to see more of the LIR at once and make bigger level decisions (or hint the processor-specific portions about them), or to expand the LIR opcode set to embrace more specific purpose-oriented opcodes that could hint a code generator like this one about the intended purpose of a section of code. Certainly for loops there is a better way of implementing it in assembly than this. I don't know how likely that is to happen, however.

The plan for the nanojit then is this: it will come out as part of beta 9, turned on by default (you can turn it off in about:config). It has been tested mostly on G5 and G4/7450, though I also ran it on a G4/7400 and it seemed okay. I don't have a G3 running Tiger to test this with, so you guys will be on your own. See if it's faster for your workload. See if it's something you can improve upon. It's a start.

2 comments:

  1. Hi all, I've published the tenfourfox package to Fink's unstable tree (beta 8). It seems to run great on 10.4/ppc (G4). If you have fink installed, and the unstable tree enabled:

    $ fink install tenfourfox

    will grab sources, patches, build dependencies, and package it up to a .deb. Installed app be launched from Terminal/xterm with 'tenfourfox', or by docking /Applications/Fink/TenFourFox.app.

    Enjoy!

    ReplyDelete
  2. I have added this to the How to Build wiki entry. Thanks, dude!

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.