Assessing my predictions for 2009

January 24, 2010 No comments

I have been rather remiss in revisiting the predictions I made for 2009 to see how they fared. Only two out of the six predictions were sufficiently precise to enable an assessment one year later; the other four talking more about trends. What of these two predictions?

The LLVM project will die. Ok, the project is still going and at the end of December the compiler could build itself but the build is not yet in a state to self host (i.e., llvm compiler creates an executable of itself from its own source and this executable can build an executable from source that is identical to itself {modulo date/time stamps etc}). Self hosting is a major event and on some of the projects I have worked on it has been a contractual payment milestone.

Is llvm competition for gcc? While there might not be much commercial competition as such (would Apple be providing funding to gcc if it were not involved in llvm?), I’m sure that developers working on both projects want their respective compiler to be the better one. According to some llvm benchmarks they compile code twice as fast as gcc. If this performance difference holds up across lots of source how might the gcc folk respond? Is gcc compiler time performance an issue that developers care about or is quality of generated code more important? For me the latter is more important and I have not been able to find a reliable, recent, performance comparison. I understand that almost all gcc funding is targeted at code generation related development, so if compile time is an issue gcc may be in a hole.

I still don’t see the logic behind Apple funding this project and continue to think that this funding will be withdrawn sometime, perhaps I was just a year off. I do wish the llvm developers well and look forward to them celebrating a self hosted compiler.

Static analysis will go mainstream. Ok, I was overly optimistic here. There do not seem to have been any casualties in 2009 among the commercial vendors selling static analysis tools and the growth of PHP has resulted in a number of companies selling tools that scan for source security vulnerabilities (.e.g, SQL injection attacks).

I was hoping that the availability of Treehydra would spark the development of static analysis plugins for gcc. Perhaps the neatness of the idea blinded me to what I already knew; many developers dislike the warnings generated by the -Wall option and therefore might be expected to dislike any related kind of warning. One important usability issue is that Treehydra operates on the abstract syntax tree representation used internally by gcc, GIMPLE. Learning about this representation requires a big investment before a plugin developer becomes productive.

One tool that did experience a lot of growth in 2009 was Coccinelle, at least judged by the traffic on its mailing list. The Coccinelle developers continue to improve it and are responsive the questions. Don’t be put off by the low version number (currently 0.2), it is much better than most tools with larger version numbers.

Why does Intel sell compilers?

January 5, 2010 No comments

Intel is a commercial company and the obvious answer to the question of why it sells compilers is: to make money. Does a company that makes billions of dollars in profits really have any interest in making a few million, I’m guessing, from selling compilers? Of course not, Intel’s interest in compilers is as a means of helping them sell more hardware.

How can a compiler be used to increase computer hardware sales? One possibility is improved performance and another is customer perception of improved performance. My company’s first product was a code optimizer and I was always surprised by the number of customers who bought the product without ever performing any before/after timing benchmarks; I learned that engineers are seduced by the image of performance and only a few are ever forced to get involved in measuring it (having been backed into a corner because what they currently have is not good enough).

Intel are not the only company selling x86 chips, AMD and VIA have their own Intel compatible x86 chips. Intel compatible? Doesn’t that mean that programs compiled using the Intel compiler will execute just as quickly on the equivalent chip sold by competitors? Technically the answer is no, but the performance differences are likely to be small in most cases. However, I’m sure there are many developers who have been seduced by Intel’s marketing machine into believing that they need to purchase x86 chips from Intel to make sure they receive this ‘worthwhile’ benefit.

Where do manufacturer performance differences, for the same sequence of instructions, come from? They are caused by the often substantial internal architectural difference between the processors sold by different manufacturers, also Intel and its competitors are continually introducing new processor architectures and processors from the same company will have differences performance characteristics. It is possible for an optimizer to make use of information on different processor characteristics to tune the machine code generated for a particular high-level language construct, with the developer selecting the desired optimization target processor via compiler switches.

Optimizing code for one particular processor architecture is a specialist market. But let’s not forget all those customers who will be seduced by the image of performance and ignore details such as their programs being executed on a wide variety of architectures.

The quality of a compiler’s runtime library can have a significant impact on a program’s performance. The x86 instruction set has grown over time and large performance gains can be had by a library function that adapts to the instructions available on the processor it is currently executing on. The CPUID instruction provides all of the necessary information.

As well as providing information on the kind of processor and its architectural features the CPUID instruction can return information about the claimed manufacturer of the chip (some manufacturers provide a mechanism that allows users to change the character sequence returned by this instruction).

The behavior of some of the functions in Intel’s runtime library depends on the
character sequence returned by the CPUID instruction, producing better performance for the sequence “GenuineIntel”. The US Federal Trade Commission have filed a complaint alleging that this is anti-competitive (more details) and requested that this manufacturer dependency be removed.

I think that removing this manufacturer dependency will have little impact on sales. Any Intel compiler user who is not targeting Intel chips and who is has a real interest in performance can patch the runtime library, the Supercomputer crowd will want to talk to the kind of sophisticated processor/compiler engineers that Intel makes available and for everybody else it is about the perception of performance. In fact Intel ought to agree to a ‘manufacturer free’ runtime library pronto before too many developers have their delusions shattered.

The changing shape of code in the next decade

December 29, 2009 No comments

I think there are two forces that will have a major impact on the shape of code in the next decade:

  • Asian developers. China and India each have a population that is more than twice as large as Europe and the US combined, and software development has been kick-started in these countries by a significant amount of IT out sourcing. I have one comparative data point for software developers who might be of the hacker ilk. A discussion of my C book on a Chinese blog resulted in a download volume that was 50% of the size of the one that occurred when the book appeared as a news item on Slashdot.
  • Scripting languages. Software is written to solve a problem, and there are only so many packaged applications (COTS or bespoke) that can profitably be supported. Scripting languages are generally designed to operate within one application domain, e.g., Bash, numerical analysis languages such as R and graphical plotting languages such as gnuplot.

While markup languages are very widely used they tend to be read and written by programs not people.

Having to read code containing non-alphabetic characters is always a shock the first time. Simply having to compare two sequences of symbols for equality is hard work. My first experience of having to do this in real time was checking train station names once I had travelled outside central Tokyo and the names were no longer also given in Romaji.

其中,ul分别是bootmap_size(bit map的size),start_pfn(开始的页框)
                                max_low_pfn(被内核直接映射的最后一个页框的页框号) ;

Developers based in China and India have many different cultural conventions compared to the West (and each other) and I suspect that these will affect the code they write (my favorite potential effect involves treating time vertically rather than horizontally). Many coding conventions used by a given programming language community exist because of the habits adopted by early users of that language, these being passed on to subsequent users. How many Chinese and Indian developers are being taught to use these conventions, are the influential teachers spreading different conventions? I don’t have a problem with different conventions being adopted, other than that having different communities using different conventions increases the cost for one community to adopt another community’s source.

Programs written in a scripting language tend to be much shorter (often being contained within a single file) and make use of much more application knowledge than programs written in general purpose languages. Their data flow tends to be relatively simple (e.g., some values are read/calculated and passed to a function that has some external effect), while the relative complexity of the control flow seems to depend on the language (I only have a few data points for both assertions).

Because of their specialized nature, most scripting languages will not have enough users to support any kind of third party support tool market, e.g., testing tools. Does this mean that programs written in a scripting language will contain proportionally more faults? Perhaps their small size means that only a small number of execution paths are possible, and these are quickly exercised by everyday usage (I don’t know of any research on this topic).

The Met Office ‘climategate’ Perl code

December 25, 2009 3 comments

In response to the Climategate goings on the UK Meteorological Office has released a subset of its land surface climate station records and some code to process it. The code consists of 397 lines of Perl (station_gridder.perl and pretty printer and kind of implies more than one person doing the editing. And why are some variables names capitalized and other not (the names in subroutine read_station are all lower case, while the names in the surrounding subroutines are mostly upper case)? More than one author is the simplest answer.

One Perl usage caught my eye, the construct unless is rarely used and often recommended against. Without a lot more code being available for analysis there are no obviously conclusions to draw from this usage (apart from it being an indicator of somebody who knows Perl well, most mainstream languages do not support this construct and developers have to use a ‘positive’ construct containing a negated condition rather than a ‘negative’ construct containing a positive condition).

Compiler writing in the next decade

December 22, 2009 No comments

What will be the big issues in compiler writing in the next decade? Compilers sit between languages and hardware, with the hardware side usually providing the economic incentive.

Before we set off to follow the money, what about the side that developers prefer to talk about. The last decade has not seen any convergence to a very small number of commonly used languages, if anything there seems to have been a divergence with more languages in widespread use. I will not attempt to predict whether there will be a new (in the sense of previously limited to a few research projects) concept that is widely integrated and used in many languages (i.e., the integrating of object-oriented features into languages in the 90s).

Where is hardware going?

  • Moore’s law stops being followed. Moore’s law is an economic one that has a number of technical consequences (e.g., less power consumed and until recently increasing clock rates). Will the x86 architecture evolution dramatically slow down once processor manufacturers are no longer able to cram more transistors onto the same amount of chip real estate? Perhaps processor designers will start looking to compiler writers to suggest functionality that could be made use of by compilers to generate more optimal code. To date, my experience of processor designers is that they look to Moore’s law to get a ‘free’ performance boost.

    There are a number of things a compiler code tell the processor, such as when a read or write to a cache line is the last one that will occur for a long time (enabling that line to be moved to the top of the reuse list).

  • Not plugged into the mains. When I made a living writing optimizers, the only two optimizations choices were code size and performance. There are a surprising number of functional areas in which a compiler, given processor support, can potentially generate code that consumes less power. More on this issue in a later post.
  • More than one processor. Figuring out how to distribute a program across multiple, loosely coupled, processors remains a difficult research topic. If anybody ever comes up with a solution to this problem, it might make more commercial sense for them to keep it secret, selling a compiling service rather than selling compilers.
  • Application Specific Instruction-set Processors. Most processors in embedded systems only ever run a single program. The idea of each program being executed on a processor optimized to its requirements sounds seductive. At the moment the economics are such that it is cheaper to take an existing, very low cost, processor and shoe-horn the application onto it. If the economics change, the compiler used for each processor is likely to be automatically generated.

Enough of the hardware, this site is supposed to be about code:

  • New implementation techniques. These include GLR parsing and genetic algorithms to improve the generated code quality. The general availability of development machines containing more than 4G of memory will make it worthwhile for compiler writers to implement more whole program optimizations (which are currently being hemmed in by storage limits)
  • gcc will continue its rise to world domination. The main force at work here is not the quality of gcc, but the disappearance of the competition. Compiler writing is not a big bucks business, and compiler companies are regularly bought up by much larger hardware outfits looking to gain some edge. A few years go by, plans change, the compiler group are not making enough profit to warrant the time being spent on them by upper management, and they are closed down. One less compiler vendor and a bunch of developers are forced to migrate to another compiler, which may or may not be gcc.
  • Figuring out what the developer meant to write based on what they actually wrote, and some mental model of software developers, is my own research interest. This is somewhat leading edge stuff, in other words, nothing major has been achieved so far. Knowledge of developer intent looks like it will open the door to whole classes of new optimization techniques.

Parsing Fortran 95

December 20, 2009 No comments

I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser.

The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I decided to take the opportunity to learn something about the new features by writing a Fortran parser that did not require a symbol table.

The Eli project had a Fortran 90 grammar that was close to having a form acceptable to bison and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict. These conflicts looked like they could all be handled using glr parsing. The grammar contained 922 productions, somewhat large, but I was only interested in actively making use of parts of it.

For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects. Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not. This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword do or the identifier do5i (if 1 is followed by a comma it must be a keyword):

      do 5 i = 1 , 10
      do 5 i = 1 . 10        ! assign 1.10 to do5i
5     continue

In my experience developers don’t break up literals or identifier names with whitespace, and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).

In Fortran the I/O is specified in the language syntax, while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime. I decided to ignore I/O statements by skipping to the end of line (Fortran is line oriented).

Then the number of keywords hit me, around 190. Even with the simplifications, I had made writing a Fortran lexer look like it would be a lot of work; some of the keywords only had this status when followed by a = and I kept uncovering new issues. Cutting and pasting somebody else’s lexer would probably also involve a lot of work.

I went back and looked at some of the Fortran front ends I had found on the Internet. The GNU Fortran front-end was a huge beast and would need serious cutting back to be of use. moware was written in Fortran and used the traditional six character abbreviated names seen in ‘old-style’ Fortran source and not a lot of commenting. The Eli project seemed a lot more interested in the formalism side of things, and Fortran was just one of the languages they claimed to support.

The Open Fortran Parser looked very interesting. It was designed to be used as a parsing skeleton that could be used to produce tools that processed source, and already contained hooks that output diagnostic output when each language production was reduced during a parse. Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quite often (and not documented in their manual). The tool source, in Java, looked straightforward to follow, and it was obvious where my code needed to be added. This tool was exactly what I needed 🙂

Information content of expressions

December 11, 2009 No comments

Software developers read source code to obtain information. How might the information content of source code be quantified?

Both of the following functions assign the same value to x and if that is the only information a reader of that code is interested in, then the information content of both assignment statements could be said to be the same.

int foo(void)
{
x = 5;
...
}
 
int bar(void)
{
x = 2 + 3;
...

A reader seeking deeper understanding of the above code would ask why the value 5 is built from two values in bar. One reason might be that the author of the function wanted to explicitly call out background information about how the value 5 was derived (this is often done using symbolic names, but the use of literals themselves is sometimes encountered). Perhaps the author of foo did not see the need to expose this information or perhaps the shared value is purely coincidental.

If the two representations denote the same quantity doesn’t the second have a greater information content for a reader seeking deeper understanding?

In the following example:

... x + y & z ...
 
...
 
... num_red + num_white & lower_bits ...

an experienced developer with a knowledge of English is likely to interpret the expression as adding the number of occurrences of two quantities and using bit-wise AND to extract the lower bits. For some readers the second expression has a higher information content. Would use of the names number_of_red further increase the information content?

In the following example the first expression has not added any information that was not already present in the first expression above (except perhaps that the author was not certain of the precedence or perhaps did not expect subsequent readers to be certain).

... ( x + y ) & z ...
 
...
 
... x + ( y & z ) ...

The second expression uses parenthesis to achieve an operand/operator binding that is different from the default. Has this changed the information content of the expression?

There is experimental evidence that developers extract information from the names of variables to help them make decisions about operator precedence. To me the name all_32_bits_one suggests a sequence of bits and I would expect such a representation to be associated with the bit-wise AND operator, not binary plus. With no knowledge of the relative precedence of the two operators in the following expression the name of the middle operand would cause me to misinterpret the code. Does this change the information content of the expression? Does knowledge of the experimental evidence and the correct operator precedence change the information content (i.e., there is a potential fault in the code because the author may have assumed the incorrect precedence)?

... num_red + all_32_bits_one & sign_bit ...

There is experimental evidence that people use the amount of whitespace appearing between operands and their operators to visually highlight operator precedence

The relative quantities of whitespace used in the following two expressions appear to tell very different stories. Do the two expressions have a different information content?

... x  +  y & z ...
 
...
 
... x + y  &  z ...

The idea of measuring the information content of source code is very enticing. However, an accurate measure requires knowledge of the kind of information a reader is trying to obtain and of information that already exists in their brain.

Another question is the easy with which information can be extracted from code. Something that might be labeled as readability, except that readability has connotations of there being an abundant supply of information to extract.

Christmas books for 2009

December 7, 2009 No comments

I thought it would be useful to list the books that gripped me one way or another this year (and may be last year since I don’t usually track such things closely); perhaps they will give you some ideas to add to your Christmas present wish list (please make your own suggestions in the Comments). Most of the books were published a few years ago, I maintain piles of books ordered by when I plan to read them and books migrate between piles until eventually read. Looking at the list I don’t seem to have read many good books this year, perhaps I am spending too much time reading blogs.

These books contain plenty of facts backed up by numbers and an analytic approach and are ordered by physical size.

The New Science of Strong Materials by J. E. Gordon. Ideal for train journeys since it is a small book that can be read in small chunks and is not too taxing. Offers lots of insight into those properties of various materials that are needed to build things (‘new’ here means postwar).

Europe at War 1939-1945 by Norman Davies. A fascinating analysis of the war from a numbers perspective. It is hard to escape the conclusion that in the grand scheme of things us plucky Brits made a rather small contribution, although subsequent Hollywood output has suggested otherwise. Also a contender for a train book.

Japanese English language and culture contact by James Stanlaw. If you are into Japanese culture you will love this, otherwise avoid.

Evolutionary Dynamics by Martin A. Nowak. For the more mathematical folk and plenty of thought power needed. Some very powerful general results from simple processes.

Analytic Combinatorics by Philippe Flajolet and Robert Sedgewick. Probably the toughest mathematical book I have kept at (yet to get close to the end) in a few years. If number sequences fascinate you then give it a try (a pdf is available).

Probability and Computing by Michael Mitzenmacher and Eli Upfal. For the more mathematical folk and plenty of thought power needed. Don’t let the density of Theorems put you off, the approach is broad brush. Plenty of interesting results with applications to solving problems using algorithms containing a randomizing component.

Network Algorithmics by George Varghese. A real hackers book. Not so much a book about algorithms used to solve networking problems but a book about making engineering trade-offs and using every ounce of computing functionality to solve problems having severe resource and real-time constraints.

Virtual Machines by James E. Smith and Ravi Nair. Everything you every wanted to know about virtual machines and more.

Biological Psychology by James W. Kalat. This might be a coffee table book for scientists. Great illustrations, concise explanations, the nuts and bolts of how our bodies runs at the protein/DNA level.

Does the Climategate code produce reliable output?

November 30, 2009 No comments

The source of several rather important commercial programs have been made public recently, or to be more exact programs whose output is important (i.e., the Sequoia voting system and code and data from the Climate Research Unit at University of East Anglia the so called ‘Climategate’ leak). While many technical commentators have expressed amazement at how amateurish the programming appears to be, apparently written with little knowledge of good software engineering practices or knowledge of the programming language being used, those who work on commercial projects know that low levels of software engineering/programming competence is the norm.

The emails included in the Climategate leak provide another vivid example, if one were needed, of why scientific data should be made publicly available; scientists are human and are sometimes willing to hide data that does not fit their pet theory or even fails to validate their theory at all.

The Climategate source has only only recently become available and existing technical commentary has been derived from embarassing comments and the usual complaint about not using the right programming language (Fortran is actually a good choice of language for this problem, it is widely used by climatology researchers and a non-professional programmer is probably makes best of their time by using the one language they know tolerably well rather than attempting to use a new language that nobody else in the research group knows).

An important quality indicator of the leaked software was what was not there, test cases (at least I could not find any). How do we know that a program’s output is correct? One way to gain some confidence in a program’s correctness is to process data for which the correct output is known. This blindness to the importance of program level correctness testing is something that I often encounter in people who are subject area experts rather than professional programmers; they believe that if the output has the form they are expecting it must be correct and will sometimes add ‘faults’ to ‘fix’ output that deviates from what they are expecting.

A quick visual scan through the source showed a tale of two worlds, one of single letter identifier names and liberal use of goto, and the other of what looks like meaningful names, structured code and a non-trivial number of comments. The individuals who have contributed to the code base obviously have very different levels of coding ability. Not having written any Fortran in anger for over 15 years my ability to estimate the impact of more subtle coding practices has atrophied.

What kind of faults might a code review look for in these programs? Common coding errors such as using uninitialized variables and incorrect argument passing are obvious choices and their are tools available to check for these kinds of error. A much more insidious kind of error, which requires people with the mathematical expertise to spot, is created by the approximate nature of floating-point arithmetic.

The source is not huge, but not small either, consisting of around 64,000 lines of Fortran and 16,000 lines of IDL (a language designed for interactive data analysis which to my untrained eye looks a lot like MATLAB). There was no obvious support for building the source included within the leaked files (e.g., no makefiles) and my attempt to manually compile using the GNU Fortran compiler failed miserably. So I cannot say anything reliable about the compiler output warnings.

To me the complete lack of test cases implies that the Climategate code does not produce reliable output. Comments in the code such as ***** APPLIES A VERY ARTIFICIAL CORRECTION FOR DECLINE********* suggests that the authors were willing to patch the code to produce output that matched their expectations; this is the mentality of somebody for whom code correctness is not an important issue and if they don’t believe their code is correct then I don’t either.

Source code in itself is rarely that important, although it might have been expensive to create. The real important information in the leaked files is the climate data. Now that this is available others can apply their analysis skills to provide an interpretation to what, if anything statistically reliable, it is telling us.

Software maintenance via genetic programming

November 27, 2009 No comments

Genetic algorithms have been used to find solution to a wide variety of problems, including compiler optimizations. It was only a matter of time before somebody applied these techniques to fixing faults in source code.

When I first skimmed the paper “A Genetic Programming Approach to Automated Software Repair” I was surprised at how successful the genetic algorithm was, using as it did such a relatively small amount of cpu resources. A more careful reading of the paper located one very useful technique for reducing the size of the search space; the automated software repair system started by profiling the code to find out which parts of it were executed by the test cases and only considered statements that were executed by these tests for mutation operations (they give a much higher weighting to statements only executed by the failing test case than to statements executed by the other tests; I am a bit surprised that this weighting difference is worthwhile). I hate to think of the amount of time I have wasted trying to fix a bug by looking at code that was not executed by the test case I was running.

I learned more about this very interesting system from one of the authors when he gave the keynote at a workshop organized by people associated with a source code analysis group I was a member of.

The search space was further constrained by only performing mutations at the statement level (i.e., expressions and declarations were not touched) and restricting the set of candidate statements for insertion into the code to those statements already contained within the code, such as if (x != NULL) (i.e., new statements were not randomly created and existing statements were not modified in any way). As measurements of existing code show most uses of a construct are covered by a few simple cases and most statements are constructed from a small number of commonly used constructs. It is no surprise that restricting the candidate insertion set to existing code works so well. Of course no fault fix that depends on using a statement not contained within the source will ever be found.

There is ongoing work looking at genetic modifications at the expression level. This
work shares a problem with GA driven test coverage algorithms; how to find ‘magic numbers’ (in the case of test coverage the magic numbers are those that will cause a controlling expression to be true or false). Literals in source code, like those on the web, tend to follow a power’ish law but the fit to Benford’s law is not good.

Once mutated source that correctly processes the previously failing test case, plus continuing to pass the other test cases, has been generated the code is passed to the final phase of the automated software repair system. Many mutations have no effect on program behavior (the DNA term intron is sometimes applied to them) and the final phase removes any of the added statements that have no effect on test suite output (Westley Weimer said that a reduction from 50 statements to 10 statements is common).

Might the ideas behind this very interesting research system end up being used in ‘live’ software? I think so. There are systems that operate 24/7 where faults cost money. One can imagine a fault being encountered late at night, a genetic based system fixing the fault which then updates the live system, the human developers being informed and deciding what to do later. It does not take much imagination to see the cost advantages driving expensive human input out of the loop in some cases.

An on-going research topic is the extent to which a good quality test suite is needed to ensure that mutated fault fixes don’t introduce new faults. Human written software is known to often be remarkably tolerant to the presence of faults. Perhaps ensuring that software has this characteristic is something that should be investigated.