Whole-program optimization: there’s gold in them hills
Information is the life-blood of compiler optimization and compiler writers are always saying “If only this-or-that were known, we could do this optimization.”
Burrowing down the knowing this-or-that rabbit-hole leads to requiring that code generation be delayed until link-time, because it is only at link-time that all the necessary information is available.
Whole program optimization, as it is known, has been a reality in the major desktop compilers for several years, in part because computers having 64G+ of memory have become generally available (compiler optimization has always been limited by the amount of memory available in developer computers). By reality, I mean compilers support whole-program-optimization flags and do some optimizations that require holding a representation of the entire program in memory (even Microsoft, not a company known for leading edge compiler products {although they have leading edge compiler people in their research group} supports an option having this name).
It is still early days for optimizations that are “whole-program”, rather like the early days of code optimization (when things like loop unrolling were leading edge and even constant folding was not common).
An annoying developer characteristic is writing code that calls a function every five statements, on average. Calling a function means that all those potentially reusable values that have been loaded into registers, before the call, cannot be safely used after the call (figuring out whether it is worth saving/restoring around the call is hard; yes developers, it’s all your fault that us compiler writers have such a tough job :-P).
Solutions to the function call problem include inlining and flow-analysis to figure out the consequences of the call. However, the called function calls other functions, which in-turn burrow further down the rabbit hole.
With whole-program optimization, all the code is available for analysis; given enough memory and processor time, lots of useful information can be figured out. Most functions are only called once, so there are lots of savings to be had from using known parameter values (many are numeric literals) to figure out whether an if-statement is necessary (i.e., is dead code) and how many times loops iterate.
More fully applying known techniques is the obvious easy use-case for whole-program optimization, but the savings probably won’t be that big. What about new techniques that previously could not even be attempted?
For instance, walking data structures until some condition is met is a common operation. Loading the field/member being tested and the next
/prev
field, results in one or more cache lines being filled (on the assumption that values adjacent in storage are likely to be accessed in the immediate future). However, data structures often contain many fields, only a few of which need to be accessed during the search process, when the next value needed is in another instance of the struct
/record
/class
it is unlikely to already be available in the loaded cache line. One useful optimization is to split the data structure into two structures, one holding the fields accessed during the iterative search and the other holding everything else. This data-remapping means that cache lines are more likely to contain the next value accessed approaches increases the likelihood that cache lines will hold values needed in the near future; the compiler looks after the details. Performance gains of 27% have been reported
One study of C++ found that on average 12% of members were dead, i.e., never accessed in the code. Removing these saved 4.4% of storage, but again the potential performance gain comes from improve the cache hit ratio.
The row/column storage layout of arrays is not cache friendly, using Morton-order layout can have a big performance impact.
There are really really big savings to be had by providing compilers with a means of controlling the processor’s caches, e.g., instructions to load and flush cache lines. At the moment researchers are limited to simulations show that substantial power savings+some performance gain are possible.
Go forth and think “whole-program”.
Recent Comments