Home > Uncategorized > Compiler benchmarking for the 21st century

Compiler benchmarking for the 21st century

I would like to propose a new way of measuring the quality of a compiler’s code generator: The highest quality compiler is one that generates identical code for all programs that produce the same output, e.g., a compiler might spot programs that calculate pi and always generate code that uses the most rapidly converging method known. This is a very different approach to the traditional methods based on using (mostly) execution time or size (usually code but sometimes data) as a measure of quality.

Why is a new measurement method needed, and why choose this one? It is relatively easy for compiler vendors to tune their products to the commonly used benchmark and they seem to have lost their role as drivers for new optimization techniques. Different developers have different writing habits and companies should not have to waste time and money changing developer habits just to get the best quality code out of a compiler; compilers should handle differences in developer coding habits and not let it affect the quality of generated code. There are major savings to be had by optimizing the effect that developers are trying to achieve rather than what they have actually written (these days new optimizations targeting at what developers have written show very low percentage improvements).

Deducing that a function calculates pi requires a level of sophistication in whole program analysis that is unlikely to be available in production compilers for some years to come (ok, detecting 4*atan(1.0) is possible today). What is needed is a collection of compilable files containing source code that aims to achieve an outcome in lots of different ways. To get the ball rolling the “3n times 2” problem is presented as the first of this new breed of benchmarks.

The “3n times 2” problem is a variant on the 3n+1 problem that has been tweaked to create more optimization opportunities. One implementation of the “3n times 2” problem is:

if (is_odd(n))
   n = 3*n+1;
else
   n = 2*n; // this is n = n / 2; in the 3n+1 problem

There are lots of ways of writing code that has the same effect, some of the statements I have seen for calculating n=3*n+1 include: n = n + n + n + 1, n = (n << 1) + n + 1 and n *= 3; n++, while some of the ways of checking if n is odd include: n & 1, (n / 2)*2 != n and n % 2.

I have created a list of different ways in which 3*n+1 might be calculated and is_odd(n) might be tested and written a script to generate a function containing all possible permutations (to reduce the number of combinations no variants were created for the least interesting case of n=2*n, which was always generated in this form). The following is a snippet of the generated code (download everything):

if (n & 1) n=(n << 2) - n +1; else n*=2;
if (n & 1) n=3*n+1; else n*=2;
if (n & 1) n += 2*n +1; else n*=2;
if ((n / 2)*2 != n) { t=(n << 1); n=t+n+1; } else n*=2;
if ((n / 2)*2 != n) { n*=3; n++; } else n*=2;

Benchmarks need a means of summarizing the results and here I make a stab at doing that for gcc 4.6.1 and llvm 2.9, when executed using the -O3 option (output here and here). Both compilers generated a total of four different sequences for the 27 'different' statements (I'm not sure what to do about the inline function tests and have ignored them here) with none of the sequences being shared between compilers. The following lists the number of occurrences of each sequence, e.g., gcc generated one sequence 16 times, another 8 times and so on:

gcc    16   8   2   1
llvm   12   6   6   3

How might we turn these counts into a single number that enables compiler performance to be compared? One possibility is to award 1 point for each of the most common sequence, 1/2 point for each of the second most common, 1/4 for the third and so on. Using this scheme, gcc gets 20.625, and llvm gets 16.875. So gcc has greater consistency (I am loathed to use the much overused phrase higher quality).

Now for a closer look at the code generated.

Both compilers always generated code to test the least significant bit for the conditional expressions n & 1 and n % 2. For the test (n / 2)*2 != n gcc generated the not very clever right-shift/left-shift/compare while llvm and'ed out the bottom bit and then compared; so both compilers failed to handle what is a surprisingly common check for a number being odd.

The optimal code for n=3*n+1 on a modern x86 processor is (lots of register combinations are possible, let's assume rdx contains n) leal 1(%rdx,%rdx,2), %edx and this is what both compilers generated a lot of the time. This locally optimal code is not always generated because:

  • gcc fails to detect that (n << 2)-n+1 is equivalent to (n << 1)+n+1 and generates the sequence leal 0(,%rax,4), %edx ; subl %eax, %edx ; addl $1, %edx (I pointed this out to a gcc maintainer sometime ago, and he suggested reporting it as a bug). This 'bug' occurs three times in total.
  • For some forms of the calculation llvm generates globally better code by taking the else arm into consideration. For instance, when the calculation is written as n += (n << 1) +1 llvm deduces that (n << 1) and the 2*n in the else are equivalent, evaluates this value into a register before performing the conditional test thus removing the need for an unconditional jump around the 'else' code:
      leal	(%rax,%rax), %ecx
      testb	$1, %al
      je	.LBB0_8
    # BB#7:
      orl	$1, %ecx     # deduced ecx is even, arithmetic unit not needed!
      addl	%eax, %ecx
    .LBB0_8:

    This more efficient sequence occurs nine times in total.

The most optimal sequence was generated by gcc:

	testb	$1, %dl
	leal	(%rdx,%rdx), %eax
	je	.L6
	leal	1(%rdx,%rdx,2), %eax
.L6:

with llvm and pre 4.6 versions of gcc generating the more traditional form (above, gcc 4.6.1 assumes that the 'then' arm is the most likely to be executed and trades off a leal against a very slow jmp):

	testb	$1, %al
	je	.LBB0_5
# BB#4:
	leal	1(%rax,%rax,2), %eax
	jmp	.LBB0_6
.LBB0_5:
	addl	%eax, %eax
.LBB0_6:

There is still room for improvement, perhaps by using the conditional move instruction (which gcc actually generates within the not-very-clever code sequence for (n / 2)*2 != n) or by using the fact that eax already holds 2*n (the potential saving would come through a reduction in complexity of the internal resources needed to execute the instruction).

llvm insists on storing the calculated value back into n at the end of every statement. I'm not sure if this is a bug or a feature designed to make runtime debugging easier (if so it ought to be switched off by default).

Missed optimization opportunities (not intended to be part of this benchmark and if encountered would require a restructuring of the test source) include noticing that if n is odd then 3n+1 is always even, creating the opportunity to perform the following multiply by 2 without an if test.

Perhaps one day, compilers will figure out when a program is calculating pi and generate code that uses the best known algorithm. In the meantime, I am interested in hearing suggestions for additional different-algorithm-same-code benchmarks.

  1. Russ Magee
    July 25, 2011 19:32 | #1

    I fully admit to not reading your entire article — just the first few paragraphs, and the all-important concluding one, but that was enough for me to have serious doubts.

    Are you seriously proposing that compiler writers should try to invent compilers that DWIM (Do What I Mean)?

    What if I -want- my program to calculate Pi, but with an error in the 2,434,243th digit? How can -any- program know such a thing without being fully sentient, and having interviewed the programmer as to his/her intentions? I do NOT want a compiler to ‘deduce’ what my algorithm truly is. Rather, I want it to translate -exactly- what I’ve written into the equivalent machine code.

    I fear you are asking for the impossible — Alan Turing had something to say about programs evaluating the outcomes of other programs. The Halting Problem won’t be solved unless someone revolutionizes fundamental logic in some unforeseen manner.

  2. July 25, 2011 20:00 | #2

    @Russ Magee
    Provided the ‘incorrect’ 2,434,243th digit+the_others of the output was what you expected would you really go to the trouble of looking at what code the compiler generated? I’m sure compiler vendors will support a DWIS (do what I say) option for the faint of heart. Today’s compilers convert (n / 2)*2 != n to an AND of the least significant bit followed by a comparison, yet the code contains a divide/multiply and if the maintainers of gcc/llvm read this post they might add the optimization to just check one bit.
    The Halting and related problems encompass all programs that could be written. The undecidability issues go away when the set of programs under consideration is restricted to being finite. As support is added for more and more algorithms is added to a DWIM compiler it will eventually enter a phase of diminishing optimization returns and its maintainers will have to find something else to do.

  3. Russ Magee
    July 25, 2011 20:52 | #3

    @Derek

    Fair enough — if it’s for very restricted problem sets, we are not talking Halting Problem. However, if these optimizations are meant to recognize and choose the best algorithm for such specific calculations, why not just inform programmers that there’s a library function over here called ‘foo’, which uses the best-known algorithm for that specific case, and be done with it? That’s what standard libraries and frameworks are for, aren’t they?

    I guess I’m visualizing my IDE/compiler someday popping up Clippy, who then says “I see you’re trying to calculate Pi! …”. I’d rather, as a programmer, be aware that I’m writing that, and have some help from the dev environment and documentation, to find the already-existent library for that so I can make use of it.

  4. Russ Magee
    July 25, 2011 20:57 | #4

    @Derek

    Might I also add, that the optimization of (n/2)*2 != n to an AND operation isn’t really algorithm recognition, in my opinion, but just a basic cancelling out of subexpressions. Recognizing whole algorithms automatically is, I suspect, a much more difficult beast than simple arithmetic simplification.

  5. July 25, 2011 23:20 | #5

    @Russ Magee
    I would not say very restricted, just non-infinite. During initial development developers are happy to change their code to call lib_magic or whatever, but developers are often unwilling to touch existing code and want the compiler to do whatever is necessary. The numbers tool is a step in the direction of algorithm recognition.

    Generating a bit test from (n/2)*2 != n, rather than AND/compare, might be said to be detecting the algorithm ‘is odd’.

  6. Worpz
    July 27, 2011 14:00 | #6

    A strong optimizing compiler has a large “null-space” of source code variations.

    (This is not a new way to describe the quality of code generation, there’s at least one paper that proposes this measure. I can’t re-find the ref at right now, will try to post it later.)

  7. July 27, 2011 15:36 | #7

    @Worpz
    Yes, the space of code variation is very large. In the case of 3n+1 the motivating factor is optimization, developers think they are helping the compiler by writing n+n+n because multiple is ‘known’ to be slow. As algorithms get more complicated the compiler will have to deal with ignorance and incompetence, which can result in truly surprising code.

    I would be very interested in a reference to the paper you mention. There has been some work on spotting physics equations and of course there is the work in dimensional analysis.

  1. No trackbacks yet.