Archive
Predictive Modeling: 15th COW workshop
I was at a very interesting workshop on Predictive Modelling and Search Based Software Engineering on Monday/Tuesday this week, and I am going to say something about the talks that interested me. The talks were recorded, and the videos will appear on the website in a few weeks. The CREST Open Workshop (COW) runs roughly once a month and the group leader, Mark Harman, is always on the lookout for speakers, do let him know if you are in the area.
- Tim Menzies talked about how models built from one data set did well on that dataset but often not nearly as well on another (i.e., local vs global applicability of models). Academics papers usually fail to point out that any results might not be applicable outside the limited domain examined, in fact they often give the impression of being generally applicable.
Me: Industry likes global solutions because it makes life simpler and because local data is often not available. It is a serious problem if, for existing methods, data on one part of a companies’ software development activity is of limited use in predicting something about a different development activity in the same company and completely useless at predicting things at a different company.
- Yuriy Brun talked about something that is so obviously a good idea, it is hard to believe that it had not been done years ago. The idea is to have your development environment be aware of what changes other software developers have made to their local copies of source files you also have checked out from version control. You are warned as soon your local copy conflicts with somebody else’s local copy, i.e., a conflict would occur if you both check in your local copy to the central repository. This warning has the potential to save lots of time by having developers talk to each about resolving the conflict before doing any more work that depends on the conflicting change.
Crystal is a plug-in for Eclipse that implements this functionality, and Visual Studio support is expected in a couple of releases time.
I have previously written about how multi-core processors will change software development tools, and I think this idea falls into that category.
- Martin Shepperd presented a very worrying finding. An analysis of the results published in 18 papers dealing with fault prediction found that the best predictor (over 60%) of agreement between results in different papers was co-authorship. That is, when somebody co-authored a paper with another person, any other papers they published were more likely to agree with other results published by that person than with results published by somebody they had not co-authored a paper with. This suggests that each separate group of authors is doing something different that significantly affects their results; this might be differences in software packages being used, differences in configuration options or tuning parameters, so something else.
It might be expected that agreement between results would depend on the techniques used, but Shepperd et al’s analysis found this kind of dependency to be very small.
An effect is occurring that is not documented in the published papers; this is not how things are supposed to be. There was lots of interest in obtaining the raw data to replicate the analysis.
- Camilo Fitzgerald talked about predicting various kinds of feature request ‘failures’ and presented initial results based on data mined from various open source projects; possible ‘failures’ included a new feature being added and later removed and significant delay (e.g., 1 year) in implementing a requested feature. I have previously written about empirical software engineering only being a few years old and this research is a great example of how whole new areas of research are being opened up by the availability of huge amounts of data on open source projects.
One hint for PhD students: It is no good doing very interesting work if you don’t keep your web page up to date, so people can find out more about it
I talked to people who found other presentations very interesting. They might have failed to catch my eye because my interest or knowledge of the subject is low or I did not understand their presentation (a few gave no background or rationale and almost instantly lost me); sometimes the talks during coffee were much more informative.
What I know of Dennis Ritchie’s involvement with C
News that Dennis Ritchie died last weekend surfaced today. This private man was involved in many ground breaking developments; I know something about one of the languages he designed, C, so I will write about that. Ritchie has written about the development of C language.
Like many language designers the book he wrote “The C Programming Language” (coauthored with Brian Kernighan in 1978) was the definitive reference for users; universally known as K&R. The rapid growth in C’s popularity led to lots of compilers being written, exposing the multiple ways it was possible to interpret some of the wording in K&R.
In 1983 ANSI set up a committee, X3J11, to create a standard for C. With one exception Ritchie was happy to keep out of the fray of standardization; on only one occasion did he feel strongly enough to step in and express an opinion and the noalias keyword disappeared from the draft (the restrict keyword surfaced in C99 as a different kind of beast).
A major contribution to the success of the C Standard was the publication of an “ANSI Standard” version of K&R (there was a red “ANSI C” stamp on the front cover and the text made use of updated constructs like function prototypes and enumeration types), its second edition in 1988. Fans of the C Standard could, and did, claim that K&R C and ANSI C were the same language (anybody using the original K&R was clearly not keeping up with the times).
Ritchie publicly admitted to making one mistake in the design of C. He thinks that the precedence of the & and | binary operators should have been greater than the == operator. I can see his point, but an experiment I ran a few years ago suggested it is amount of experience using a set of operator precedence rules that is the primary contributor to developer knowledge of the subject.
Some language designers stick with their language, enhancing it over the years (e.g., Stroustrup with C++), while others move on to other languages (e.g., Wirth with Pascal, Modula-2 and Oberon). Ritchie had plenty of other interesting projects to spend his time on and took neither approach. As far as I can tell he made little or no direct contribution to C99. As head of the research department that created Plan 9 he must have had some input to the non-Standard features of their C compiler (e.g., no support for #if and support for unnamed structures).
While the modern C world may not be affected by his passing, his ability to find simple solutions to complicated problems will be a loss to the projects he was currently working on.
Memory capacity and commercial compiler development
When I started out in the compiler business in the 80s, many commercial compilers were originally written by one person. A very good person who dedicated himself (I have never heard of a woman doing this) to the job (i.e., minimal social life) could produce a commercially viable product for a non-huge language (e.g., Fortran, Pascal, C, etc, but not C++, Ada, etc) in 12-18 months. Companies who decide to develop a compiler in-house tend to use a team of people, and take longer, because that is how they work, and they don’t want to depend on one person, and anyway such a person might not be available to them.
Commercially viable compiler development stayed within the realm of an individual effort until sometime in the early 90s. What put the single individual out of business was the growth in computer memory capacity into the hundreds of megabytes and beyond. Compilers have to be able to run within the limits of developer machines; you won’t sell many compilers that require 100M of memory if most of your customers don’t have machines with 100M+ of memory.
Code optimizers eat memory, and this prevented many optimizations that had been known about for years appearing in commercial products. Once the machines used for software development commonly contained 100M+ of memory, compiler vendors could start to incorporate these optimizations into their products.
Improvements in processor speed also helped. But developers are usually willing to let the compiler take a long time to optimize the code going into a final build, provided development compiles run at a reasonable speed.
The increase in memory capacity created the opportunity for compilers to improve, and when some did improve they made it harder for others to compete. Suddenly an individual had to spend that 12-18 months, plus many more months implementing fancy optimizations; developing a commercially viable compiler outgrew the realm of individual effort.
Some people claim that the open source model was the primary driver in killing off commercial C compiler development. I disagree. My experience of licensing compiler source was that companies preferred a commercial model they were familiar with, and reacted strongly against the idea of having to make available any changes they made to the code of an open source program. GCC (and recently llvm) succeeded because many talented people contributed fancy optimizations and these could be incorporated because developer machines contained lots of memory. If storage had not dramatically increased, gcc would probably not be the market leader it is today.
Does the UK need the PL/1 Standard?
Like everything else language standards are born and eventually die. IST/5, the UK programming language committee, is considering whether the British Standard for PL/1 should be withdrawn (there are two standards, ISO 6160:1979 which has been reconfirmed multiple times since 1979, most recently in 2008, and a standardized subset ISO 6522:1992, also last confirmed in 2008).
A language standard is born through the efforts of a group of enthusiastic people. A language standard dies because there is no enthusiast (a group of one is often sufficient) to sing its praises (or at least be willing to be a name on a list that is willing to say, every five years, that the existing document should be reconfirmed).
It is 20 years since IST/5 last had a member responsible for PL/1, but who is to say that nobody in the UK is interested in maintaining the PL/1 standard? Unlike many other programming language ISO Standards there was never an ISO SC22 committee responsible for PL/1. All of the work was done by members of the US committee responsible for programming language PL 22 (up until a few years ago this was ANSI committee X3). A UK person could have paid his dues and been involved in the US based work; I don’t have access to a list of committee meeting attendees and so cannot say for sure that there was no UK involvement.
A member of IST/5, David Muxworthy, has been trying to find somebody in the UK with an interest in maintaining the PL/1 standard. A post to the newsgroup comp.lang.pl1 eventually drew a response from a PL/1 developer who said he would not be affected if the British Standard was withdrawn.
GNU compiler development is often a useful source of information. In this case the PL/1 web page is dated 2007.
In 2008 John Klensin, the ISO PL/1 project editor, wrote: “No activities or requests for additions or clarifications during the last year or, indeed, the last decade. Both ISO 6160 and the underlying US national document, ANS X3.53-1976 (now ANSI/INCITS-53/1976), have been reaffirmed multiple times. The US Standard has been stabilized and the corresponding technical activity was eliminated earlier this year”.
It looks like the British Standard for PL/1 is not going to live past the date of its next formal review in 2013. Thirty four years would then be the time span, from publication of last standard containing new material to formal withdrawal of all standards, to outlive. I wonder if any current member of either of the C or C++ committees will live to see this happen to their work?
Automatically improving code
Compared to 20 or 30 years ago we know a lot more about the properties of algorithms and better ways of doing things often exist (e.g., more accurate, faster, more reliable, etc). The problem with this knowledge is that it takes the form of lots and lots of small specific details, not the kind of thing that developers are likely to be interested in, or good at, remembering. Rather than involve developers in the decision-making process, perhaps the compiler could figure out when to substitute something better for what had actually been written.
While developers are likely to be very happy to see what they have written behaving as accurately and reliably as they had expected (ignorance is bliss), there is always the possibility that the ‘less better’ behavior of what they had actually written had really been intended. The following examples illustrate two relatively low level ‘improvement’ transformations:
- this case is probably a long-standing fault in many binary search and merge sort functions; the relevant block of developer written code goes something like the following:
while (low <= high) { int mid = (low + high) / 2; int midVal = data[mid]; if (midVal < key) low = mid + 1 else if (midVal > key) high = mid - 1; else return mid; }
The fault is in the expression
(low + high) / 2which overflows to a negative value, and returns a negative value, if the number of items being sorted is large enough. Alternatives that don’t overflow, and that a compiler might transform the code to, include:low + ((high - low) / 2)and(low + high) >>> 1. - the second involves summing a sequence of floating-point numbers. The typical implementation is a simple loop such as the following:
sum=0.0; for i=1 to array_len sum += array_of_double[i];
which for large arrays can result in
sumlosing a great deal of accuracy. The Kahan summation algorithm tries to take account of accuracy lost in one iteration of the loop by compensating on the next iteration. If floating-point numbers were represented to infinite precision the following loop could be simplified to the one above:sum=0.0; c=0.0; for i = 1 to array_len { y = array_of_double[i] - c; // try to adjust for previous lost accuracy t = sum + y; c = (t - sum) - y; // try and gets some information on lost accuracy sum = t; }
In this case the additional accuracy is bought at the price of a decrease in performance.
Compiler maintainers are just like other workers in that they want to carry on working at what they are doing. This means they need to keep finding ways of improving their product, or at least improving it from the point of view of those willing to pay for their services.
Many low level transformations such as the above two examples would be not be that hard to implement, and some developers would regard them as useful. In some cases the behavior of the code as written would be required, and its transformed behavior would be surprising to the author, while in other cases the transformed behavior is what the developer would prefer if they were aware of it. Doesn’t it make sense to perform the transformations in those cases where the as-written behavior is least likely to be wanted?
Compilers already do things that are surprising to developers (often because the developer does not fully understand the language, many of which continue to grow in complexity). Creating the potential for more surprises is not that big a deal in the overall scheme of things.
C compiler validation is 21 today!
Today, 1 September 2011, is the 21th anniversary of the first formally validated C compilers. The three ‘equal first’ validated compilers were the Model Implementation C Checker from Knowledge Software, Topspeed C from JPI (run by the people who created Turbo Pascal) and the INMOS C compiler (derived from the Norcroft C compiler written by Alan Mycroft+others, the author of the longest response document seen during the review of the C89 draft standard).
Back in the day the British Standards Institution testing group run by John Souter were the world leaders in compiler validation and were very proactive in adding support for a new language. NIST, the equivalent US body, did not offer such a service until a few years later. Those companies in a position to have their compilers validated (i.e., the compiler passed the validation suite) were pressing BSI to be first; the ‘who is first’ issue was resolved by giving all certificates the same date (the actual validation process of a person from BSI, Neil Martin now Director of Test in the Winterop Team at Microsoft, turning up to ‘witness’ the compiler passing the tests happened several weeks earlier).
Testing C compilers was different from other language compilers in that sufficient demand existed to support commercial production and maintenance of test suites (the production of validation suites for previous language compilers had been government funded). After a review of the available test suites BSI chose to use the Plum Hall suite; after a similar review NIST chose to use the Perennial suite (I got involved in trying to figure out for NIST how well this suite covered the requirements contained in the C Standard).
For a while C compiler validation was big business (as in big fish, very small pond). But the compiler validation market is dependent on there being lots of compilers, which requires market fragmentation and to a lesser extent lots of different OSs and hardware platforms (each needing a separate validation). The 1990s saw market consolidation, gcc becoming good enough for commercial use and a shift of developer mind share to C++. Dwindling revenue resulted in BSI’s compiler validation group being shut down after a few years and NIST’s followed in 1998.
Is compiler validation relevant today? When the first C Standard was published a lot of compilers in common use had some significant behavioural differences compared to what the Standard specified. Over time these compilers have either disappeared or been upgraded (a potential customer once asked me the benefits I saw in them licensing the Knowledge Software front end and the reply to one of my responses, “you can tell your customers that the compiler is standard’s compliant”, was that this was not a benefit as they had been claiming this for years). Improvements in Intel’s x86 processor also had a hand in improving compiler Standard’s conformance; the various memory models used by the x86 processor was a huge headache for compiler writers whose products often behaved very differently under different memory models; the arrival of the Pentium with its flat 32-bit address space meant this issue disappeared over time.
These days I suspect that the major compilers targeting platforms where portability is expected (portability is often not a big expectation in the embedded world) are sufficiently compatible that developers are willing to overlook small differences with the Standard. Differences in third party libraries, GUIs and other frameworks have been the big headache for many years now.
Would the ‘platform portability’ compilers, that’s probably gcc, Microsoft, products using EDG’s front end, and perhaps llvm in the coming years, pass the latest version of the PlumHall and Perennial suites?
- The gcc team do not have access to either company’s suite. The gcc regression tests are a poor substitute for a proper compiler validation suite (even though they cost many thousands of dollars commercial compiler writers often buy both companies products because they are good value for money as a testing resource {the Fortran 78 validation suite source gives some idea of how much work is actually involved). I would expect gcc to fail some of the tests but have no idea how many or serious the failures would be.
- Microsoft have said they don’t have plans to support C99 (it took a lot of prodding to get them interested in formally validating against C90).
- I think the llvm team are in the same position as gcc, but perhaps somebody at Apple has access to one or more of the commercial suites (I don’t know).
- EDG are into standard’s conformance and I would expect them to pass both suites.
The certificate is printed on high quality, slightly yellow paper; the template wording is in a subdued gray ink while the customer information is in a very bold black ink. I don’t know whether this is to make life difficult for counterfeiters, but I could not get any half decent photographs and the color scanner had to be switched to black&white.
Validation was good for one year and I saw no worthwhile benefit in paying BSI £5,000 to renew for another year. Few people knew about the one year rule and I did not enlighten them. In the Ada compiler market the one year rule was a major problem, but lets leave that for another time.

sparse can now generate executables
How can you be confident that a source code analysis tool is correct when its analysis of a file does not result in any warning being generated? The best way I know of addressing this issue is for the analysis tool to be capable of generating an executable from the source. Executing a moderately large program requires that the translator get all sorts of complicated analysis correct and provides a huge boost to confidence of correctness compared to an analysis tool that does not have this ability.
Congratulations to the maintainers of sparse (a Linux kernel specific analysis tool started by Linus Torvalds in 2003) for becoming one of the very small number of source code analysis tools capable of generating executable programs.
The Model C Implementation is another tool capable of generating executables. I would love to be able to say that the reason for this was dedication to perfection by the project team, however, the truth is that it started life as a compiler and became an analyzer later.
Clang, the C front end+analyzer to llvm is often referred to as a static analyzer. While it does perform some static checking (like gcc does when the -Wall option is specified) a lot more checks needs to be supported before it can be considered on a par with modern static analysis tools.
GCC’s Treehydra project works via a plug-in to the compiler. This project has yet to live up to its potential, so we can delay discussion of whether it should be classified as a standalone system or an executable generator.
I cannot think of any other ‘full language’ static analysis tools capable of generating executable programs (the C-semantics tool restricts its checks to those required by the C Standard, and I think tools need to do a lot more than this to be considered static analysis tools). Corrections to my lack of knowledge welcome.
I was a little concerned that there were plans afoot to migrate sparse to becoming the build compiler for the Linux kernel. Linus answered my query by saying that this was never a goal.
Halstead’s metrics and flat-Earthers are still with us
I recently discovered a fascinating series of technical reports from the 1970s in the Purdue University e-Pubs archive that shine a surprising light on what are now known as the Halstead metrics.
The first surprises came from Halstead’s A Software Physics Analysis of Akiyama’s Debugging Data; surprising in the size of the data set used (nine measurements of four attributes), the amount of hand waving used to pluck numbers out of thin air, the substantial difference between the counting methods used then and now and the very high correlation found between various measured and calculated attributes.
I disagreed with the numbers Halstead plucked and wrote some R to check Halstead’s results and try out my own numbers. While my numbers significantly changed the effort estimation values, the high correlations between the number of faults and various source attributes remained high. Even changing from the Pearson correlation coefficient (calculating confidence bounds for this coefficient requires that the data be normally distributed, which it is not {program size is now thought to follow a power law/exponential like distribution}) to the Spearman rank correlation coefficient did not have much impact. Halstead seems to have struck luck with this data set.
What did Halstead’s colleagues at Purdue think of these metrics? The report Software Science Revisited: A Critical Analysis of the Theory and Its Empirical Support written four years after Halstead’s flurry of papers contains a lot of background material and points out the lack of any theoretical foundation for some of the equations, that the analysis of the data was weak and that a more thorough analysis suggests theory and data don’t agree. Damming stuff.
If it is known that Halstead’s metrics do not hold up why do writers of books (including Shen, Conte and Dunsmore, the authors of the above report) continue to discuss Halstead’s work? Are they treating this work like Astronomy authors treat Ptolemy’s theory (the Sun and planets revolved around the Earth), i.e., incorrect but part of history and worth a mention?
An observation in the Shen et al report, that Halstead originally proposed the metrics as a way of measuring the complexity of algorithms not programs, explains the background to the report Impurities Found in Algorithm Implementations. Halstead uses the term “impurities” and talks about the need for “purification” in his early work. In this report Halstead points out that the value of metrics for “algorithms written by students” are very different from those for the equivalent programs published in journals and goes on to list eight classes of impurity that need to be purified (i.e., removing or rewriting clumsy, inefficient or duplicate code) in order to obtain results that agree with the theory. Now we know what needs to be done to existing programs to get them to agree with the predictions made by the Halstead metrics!
Did Halstead strike lucky with the data used in his first published analysis of ‘industrial code’, obtaining a very high correlation that caused him to shift focus away from measuring algorithms to measuring programs? Unfortunately, he died soon after publishing the work for which he is now known, so he did not get to comment on how his ideas were used over the subsequent years.
Why are people still attracted to the Halstead metrics, given their poor theoretical foundations and a predictive power that is comparable with using lines of code? Is it because the idea of code volume and length are easy to understand and so are attractive (dimensionally, both of these metrics are the same, a fact that cannot hold for any self-consistent concept of volume and length)? Is it because we don’t have alternative metrics that outperform the poorly performing ones proposed by Halstead, after all Copernicus only won out because his theory gave answers that were more accurate than Ptolomy’s.
Perhaps like the flat Earthers proponents of the Halstead metrics will always be with us.
Ruby becoming an ISO Standard
The Ruby language is going through the process of becoming an ISO Standard (it has been assigned the document number ISO/IEC 30170).
There are two ways of creating an ISO Standard, a document that has been accepted by another standards’ body can be fast tracked to be accepted as-is by ISO or a committee can be set up to write the document. In the case of Ruby a standard was created under the auspices of JISC (Japanese Industrial Standards Committee) and this has now been submitted to ISO for fast tracking.
The fast track process involves balloting the 18 P-members of SC22 (the ISO committee responsible for programming languages), asking for a YES/NO/ABSTAIN vote on the submitted document becoming an ISO Standard. NO votes have to be accompanied by a list of things that need to be addressed for the vote to change to YES.
In most cases the fast tracking of a document goes through unnoticed (Microsoft’s Office Open XML being a recent high profile exception). The more conscientious P-members attempt to locate national experts who can provide worthwhile advice on the country’s response, while the others usually vote YES out of politeness.
Once an ISO Standard is published future revisions are supposed to be created using the ISO process (i.e., a committee attended by interested parties from P-member countries proposes changes, discusses and when necessary votes on them). When the C# ECMA Standard was fast tracked through ISO in 2005 Microsoft did not start working with an ISO committee but fast tracked a revised C# ECMA Standard through ISO; the UK spotted this behavior and flagged it. We will have to wait and see where work on any future revisions takes place.
Why would any group want to make the effort to create an ISO Standard? The Ruby language designers reasons appear to be “improve the compatibility between different Ruby implementations” (experience shows that compatibility is driven by customer demand not ISO Standards) and government procurement in Japan (skip to last comment).
Prior to the formal standards work the Rubyspec project aimed to create an executable specification. As far as I can see this is akin to some of the tools I wrote about a few months ago.
IST/5, the committee at British Standards responsible for language standards is looking for UK people (people in other countries have to contact their national standards’ body) interested in getting involved with the Ruby ISO Standard’s work. I am a member of IST/5 and if you email me I will pass your contact details along to the chairman.
Estimating the reliability of compiler subcomponent
Compiler stress testing can be used for more than finding bugs in compilers, it can also be used to obtain information about the reliability of individual components of a compiler. A recent blog post by John Regehr, lead investigator for the Csmith project, covered a proposal to improve an often overlooked aspect of automated compiler stress testing (removing non-essential code from a failing test case so it is small enough to be acceptable in a bug report; attaching 500 lines of source to a report in a sure fire way for it to be ignored) triggered this post. I hope that John’s proposal is funded and it would be great if the researchers involved also received funding to investigate component reliability using the data they obtain.
One process for estimating the reliability of the components of a compiler, or any other program, is:
- divide the compiler into a set of subcomponents. These components might be a collection of source files obtained through cluster analysis of the source, obtained from a functional analysis of the implementation documents or some other means,
- count the number of times each component executes correctly and incorrectly (this requires associating bugs with components by tracing bug fixes to the changes they induce in source files; obtaining this information will consume the largest amount of the human powered work) while processing lots of source. The ratio of these two numbers, for a given component, is an estimate of the reliability of that component.
How important is one component to the overall reliability of the whole compiler? This question can be answered if the set of components is treated as a Markov chain and the component transition probabilities are obtained using runtime profiling (see Large Empirical Case Study of Architecture–based Software Reliability by Goševa-Popstojanova, Hamill and Perugupalli for a detailed discussion).
Reliability is a important factor in developers’ willingness to enable some optimizations. Information from a component reliability analysis could be used to support an option that only enabled optimization components having a reliability greater than a developer supplied value.
The one big threat to validity of this approach is that stress tests are not representative of typical code. One possibility is to profile the compiler processing lots of source (say of the order of a common Linux distribution) and merge the transition probabilities, probably weighted, to those obtained from stress tests.
Recent Comments