Expected variability in a program’s SLOC

October 2, 2017 No comments

If 10 people independently implement the same specification in the same language, how much variation will there be in the length of their programs (measured in lines of code)?

The data I have suggests that the standard deviation of program length is one quarter of the mean length, e.g., 10k mean length, 2.5k standard deviation.

The plot below (code+data) shows six points from the samples I have. The point in the bottom left is based on 6,300 C programs from a programming contest question requiring solutions to the 3n+1 problem and one of the points on the right comes from five Pascal compilers for the same processor.

Mean vs standard deviation of sample program SLOC

Multiple implementations of the same specification, in the same language, are very rare. If you know of any, please let me know.

Categories: Uncategorized Tags:

Experimental method for measuring benefits of identifier naming

September 28, 2017 No comments

I was recently came across a very interesting experiment in Eran Avidan’s Master’s thesis. Regular readers will know of my interest in identifiers; while everybody agrees that identifier names have a significant impact on the effort needed to understand code, reliably measuring this impact has proven to be very difficult.

The experimental method looked like it would have some impact on subject performance, but I was not expecting a huge impact. Avidan’s advisor was Dror Feitelson, who kindly provided the experimental data, answered my questions and provided useful background information (Dror is also very interested in empirical work and provides a pdf of his book+data on workload modeling).

Avidan’s asked subjects to figure out what a particular method did, timing how long it took for them to work this out. In the control condition a subject saw the original method and in the experimental condition the method name was replaced by xxx and local and parameter names were replaced by single letter identifiers; in all cases the method name was replaced by xxx. The hypothesis was that subjects would take longer for methods modified to use ‘random’ identifier names.

A wonderfully simple idea that does not involve a lot of experimental overhead and ought to be runnable under a wide variety of conditions, plus the difference in performance is very noticeable.

The think aloud protocol was used, i.e., subjects were asked to speak their thoughts as they processed the code. Having to do this will slow people down, but has the advantage of helping to ensure that a subject really does understand the code. An overall slower response time is not important because we are interested in differences in performance.

Each of the nine subjects sequentially processed six methods, with the methods randomly assigned as controls or experimental treatments (of which there were two, locals first and parameters first).

The procedure, when a subject saw a modified method was as follows: the subject was asked to explain the method’s purpose, once an answer was given (or 10 mins had elapsed) either the local or parameter names were revealed and the subject had to again explain the method’s purpose, and when an answer was given the names of both locals and parameters was revealed and a final answer recorded. The time taken for the subject to give a correct answer was recorded.

The summary output of a model fitted using a mixed-effects model is at the end of this post (code+data; original experimental materials). There are only enough measurements to have subject as a random effect on the treatment; no order of presentation data is available to look for learning effects.

Subjects took longer for modified methods. When parameters were revealed first, subjects were 268 seconds slower (on average), and when locals were revealed first 342 seconds slower (the standard deviation of the between subject differences was 187 and 253 seconds, respectively; less than the treatment effect, surprising, perhaps a consequence of information being progressively revealed helping the slower performers).

Why is subject performance less slow when parameter names are revealed first? My thoughts: parameter names (if well-chosen) provide clues about what incoming values represent, useful information for figuring out what a method does. Locals are somewhat self-referential in that they hold local information, often derived from parameters as initial values.

What other factors could impact subject performance?

The number of occurrences of each name in the body of the method provides an opportunity to deduce information; so I think time to figure out what the method does should less when there are many uses of locals/parameters, compared to when there are few.

The ability of subjects to recognize what the code does is also important, i.e., subject code reading experience.

There are lots of interesting possibilities that can be investigated using this low cost technique.

Linear mixed model fit by REML ['lmerMod']
Formula: response ~ func + treatment + (treatment | subject)
   Data: idxx
 
REML criterion at convergence: 537.8
 
Scaled residuals:
     Min       1Q   Median       3Q      Max
-1.34985 -0.56113 -0.05058  0.60747  2.15960
 
Random effects:
 Groups   Name                      Variance Std.Dev. Corr
 subject  (Intercept)               38748    196.8
          treatmentlocals first     64163    253.3    -0.96
          treatmentparameters first 34810    186.6    -1.00  0.95
 Residual                           43187    207.8
Number of obs: 46, groups:  subject, 9
 
Fixed effects:
                          Estimate Std. Error t value
(Intercept)                  799.0      110.2   7.248
funcindexOfAny              -254.9      126.7  -2.011
funcrepeat                  -560.1      135.6  -4.132
funcreplaceChars            -397.6      126.6  -3.140
funcreverse                 -466.7      123.5  -3.779
funcsubstringBetween        -145.8      125.8  -1.159
treatmentlocals first        342.5      124.8   2.745
treatmentparameters first    267.8      106.0   2.525
 
Correlation of Fixed Effects:
            (Intr) fncnOA fncrpt fncrpC fncrvr fncsbB trtmntlf
fncndxOfAny -0.524
funcrepeat  -0.490  0.613
fncrplcChrs -0.526  0.657  0.620
funcreverse -0.510  0.651  0.638  0.656
fncsbstrngB -0.523  0.655  0.607  0.655  0.648
trtmntlclsf -0.505 -0.167 -0.182 -0.160 -0.212 -0.128
trtmntprmtf -0.495 -0.184 -0.162 -0.184 -0.228 -0.213  0.673
Categories: Uncategorized Tags: ,

An Almanac of the Internet

September 22, 2017 No comments

My search for software engineering data has turned me into a frequent buyer of second-hand computer books, many costing less than the postage of £2.80. When the following suggestion popped up along-side a search, I could not resist; there must be numbers in there!

internet almanac

The concept of an Almanac will probably be a weird idea to readers who grew up with search engines and Wikipedia. But yes, many years ago, people really did make a living by manually collecting information and selling it in printed form.

One advantage of the printed form is that updating it requires a new copy, the old copy lives on unchanged (unlike web pages); the disadvantage is taking up physical space (one day I will probably abandon this book in a British rail coffee shop).

Where did Internet users hang out in 1997?

top websites 97

The history of the Internet, as it appeared in 1997.

internet history viewed from 1997

Of course, a list of web sites is an essential component of an Internet Almanac:

website list from  1997

Categories: Uncategorized Tags:

Investing in the gcc C++ front-end

September 12, 2017 No comments

I recently found out that RedHat are investing in improving the C++ front-end of gcc, i.e., management have assigned developers to work in this area. What’s in it for RedHat? I’m told there are large companies (financial institutions feature) who think that using some of the features added to recent C++ standards (these have been appearing on a regular basis) will improve the productivity of their developers. So, RedHat are hoping this work will boost their reputation and increase their sales to these large companies. As an ex-compiler guy (ex- in the sense of being promoted to higher levels that require I don’t do anything useful), I am always in favor or companies paying people to work on compilers; go RedHat.

Is there any evidence that features that have been added to any programming language improved developer productivity? The catch to this question is defining programmer productivity. There have been several studies showing that if productivity is defined as number of assembly language lines written per day, then high level languages are more productive than assembler (the lines of assembler generated by the compiler were counted, which is rather compiler dependent).

Of the 327 commits made this year to the gcc C++ front-end, by 29 different people, 295 were made by one of 17 people employed by RedHat (over half of these commits were made by two people and there is a long tail; nine people each made less than four commits). Measuring productivity by commit counts has plenty of flaws, but has the advantage of being easy to do (thanks Jonathan).

Categories: Uncategorized Tags: ,

The fuzzy line between reworking and enhancing

August 29, 2017 No comments

One trick academics use to increase their publication count is to publish very similar papers in different conferences/journals; they essentially plagiarize themselves. This practice is frowned upon, but unless referees spot the ‘duplication’, it is difficult to prevent such plagiarized versions being published. Sometimes the knock-off paper will include additional authors and may not include some of the original authors.

How do people feel about independent authors publishing a paper where all the interesting material was derived from someone else’s paper, i.e., no joint authors? I have just encountered such a case in empirical software engineering.

“Software Cost Estimation: Present and Future” by Siba N. Mohanty from 1981 (cannot find a non-paywall pdf via Google; must exist because I have a copy) has been reworked to create “Cost Estimation: A Survey of Well-known Historic Cost Estimation Techniques” by Syed Ali Abbas and Xiaofeng Liao and Aqeel Ur Rehman and Afshan Azam and M. I. Abdullah (published in 2012; pdf here); they cite Mohanty as the source of their data, some thought has obviously gone into the reworked material and I found it useful and there is a discussion on techniques created since 1981.

What makes the 2012 stand out as interesting is the depth of analysis of the 1970s models and the data, all derived from the 1981 paper. The analysis of later models is not as interesting and doe snot include any data.

The 2012 paper did ring a few alarm bells (which rang a lot more loudly after I read the 1981 paper):

  • Why was such a well researched and interesting paper published in such an obscure (at least to me) journal? I have encountered such cases before and had email conversations with the author(s). The well-known journals have not always been friendly towards empirical research, so an empirical paper appearing in less than a stellar publication is not unusual.

    As regular readers will know I am always on the look-out for software engineering data and am willing to look far and wide. I judge a paper by its content, not the journal it was published in

  • Why, in 2012, were researchers comparing effort estimation models proposed in the 1970s? Well, I am, so why not others? It did seem odd that I could not track down papers on some of the models cited, perhaps the pdfs had disappeared since 2012??? I think I just wanted to believe others were interested in what I was interested in.

What now? Retraction watch offers some advice.

The Journal of Emerging Trends in Computing and Information Sciences has an ethics page, I will email them a link to this post and see what happens (the article in question is listed as their second most cited article last year, with 19 citations).

Microcomputers ‘killed’ Ada

August 25, 2017 No comments

In the mid-70s the US Department of Defense decided it could save lots of money by getting all its contractors to write code in the same programming language. In February 1980 a language was chosen, Ada, but by the end of the decade the DoD had snatched defeat from the jaws of victory; what happened?

I think microcomputers are what happened; these created whole new market ecosystems, some of which were much larger than the ecosystems that mainframes and minicomputers sold into.

These new ecosystems sucked up nearly all the available software developer mind-share; the DoD went from being a major employer of developers to a niche player. Developers did not want a job using Ada because they thought that being type-cast as Ada programmers would overly restrict their future job opportunities; Ada was perceived as a DoD only language (actually there was so little Ada code in the DoD, that only by counting new project starts could it get any serious ranking).

Lots of people were blindsided by the rapid rise (to world domination) of microcomputers. Compilers could profitably be sold (in some cases) for thousands or hundreds of dollars/pounds because the markets were large enough for this to be economically viable. In the DoD ecosystems, compilers had to be sold for thousands or hundreds of thousands of dollars/pounds because the markets were small. Micros were everywhere and being programmed in languages other than Ada; cheap Ada compilers arrived after today’s popular languages had taken off. There is no guarantee that cheap compilers would have made Ada a success, but they would have ensured the language was a serious contender in the popularity stakes.

By the start of the 1990s, Ada supporters were reduced to funding studies to produce glowing reports of the advantages of Ada compared to C/C++ and how Ada had many more compilers, tools and training than C++. Even the 1991 mandate “… where cost-effective, all Department of Defense software shall be written in the programming language Ada, in the absence of special exemption by an official designated by the Secretary of Defense.” failed to have an impact and was withdrawn in 1997.

The Ada mandate was cancelled as the rise of the Internet created even bigger markets, which attracted developer mind-share towards even newer languages, further reducing the comparative size of the Ada niche.

Astute readers will notice that I have not said anything about the technical merits of Ada, compared to other languages. Like all languages, Ada has its fanbois; these are essentially much older versions of the millennial fanbois of the latest web languages (e.g., Go and Rust). There is virtually no experimental evidence that any feature of any language is best/worse than any feature in any other language (a few experiments showing weak support for stronger typing). To its credit, the DoD did fund a few studies, but these used small samples (there was not yet enough Ada usage to make larger sample possible) that were suspiciously glowing in their support of Ada.

Categories: Uncategorized Tags: ,

We hereby retract the content of this paper

August 17, 2017 No comments

Yesterday I came across a paper in software engineering that had been retracted, the first time I had encountered such a paper (I had previously written about how software engineering is great discipline for an academic fraudster).

Having an example of the wording used by the IEEE to describe a retracted paper (i.e., “this paper has been found to be in violation of IEEE’s Publication Principles”), I could search for more. I get 24,400 hits listed when “software” is included in the search, but clicking through the pages there are just 71 actual results.

A search of Retraction Watch using “software engineering” returns nine hits, none of which appear related to a software paper.

I was beginning to think that no software engineering papers had been retracted, now I know of one and if I am really interested the required search terms are now known.

Categories: Uncategorized Tags:

Two approaches to arguing the 1969 IBM antitrust case

August 16, 2017 No comments

My search for software engineering data has made me a regular customer of second-hand book sellers; a recent acquisition is: “Big Blue: IBM’s use and abuse of power” by Richard DeLamarter, which contains lots of interesting sales and configuration data for IBM mainframes from the first half of the 1960s.

DeLamarter’s case, that IBM systematically abused its dominant market position, looked very convincing to me, but I saw references to work by Franklin Fisher (and others) that, it was claimed, contained arguments for IBM’s position. Keen to find more data and hear alternative interpretations of the data, I bought “Folded, Spindled, and Mutilated” by Fisher, McGowan and Greenwood (by far the cheaper of the several books that have written on the subject).

The title of the book, Folded, Spindled, and Mutilated, is an apt description of the arguments contained in the book (which is also almost completely devoid of data). Fisher et al obviously recognized the hopelessness of arguing IBM’s case and spend their time giving general introductions to various antitrust topics, arguing minor points or throwing up various smoke-screens.

An example of the contrasting approaches is calculation of market share. In order to calculate market share, the market has to be defined. DeLamarter uses figures from internal IBM memos (top management were obsessed with maintaining market share) and quote IBM lawyers’ advice to management on phrases to use (e.g., ‘Use the term market leadership, … avoid using phrasing such as “containment of competitive threats” and substitute instead “maintain position of leadership.”‘); Fisher et al arm wave at length and conclude that the appropriate market is the entire US electronic data processing industry (the more inclusive the market used, the lower the overall share that IBM will have; using this definition IBM’s market share drops from 93% in 1952 to 43% in 1972 and there is a full page graph showing this decline), the existence of IBM management memos is not mentioned.

Why do academics risk damaging their reputation by arguing these hopeless cases (I have seen it done in other contexts)? Part of the answer is a fat pay check, but also many academics’ consider consulting for industry akin to supping with the devil (so they get a free pass on any nonsense sprouted when “just doing it for the money”).

Categories: Uncategorized Tags: ,

Books similar to my empirical software engineering book

August 7, 2017 No comments

I am sometimes asked which other books are similar to the Empirical Software Engineering book I am working on.

In spirit, the most similar book is “Software Project Dynamics” by Abdel-Hamid and Madnick, based on Abdel-Hamid’s PhD thesis. The thesis/book sets out to create an integrated model of software development projects, using system dynamics (the model can be ‘run’ to produce outputs from inputs, assuming the necessary software is available).

Building a model of the software development process requires figuring out the behavior of all the important factors, and Abdel-Hamid does a thorough job of enumerating the important factors and tracking down the available empirical work (in the 1980s). The system dynamics model, written in Dynamo, appears in an appendix (I have not been able to locate any current implementation).

In the 1980s, I would have agreed with Abdel-Hamid that it was possible to build a reasonably accurate model of software development projects. Thirty years later, I have tracked down a lot more empirical work and know a more about how software projects work. All this has taught me is that I don’t know enough to be able to build a model of software development projects; but I still think it is possible, one day.

There have been other attempts to build models of major aspects of software development projects (all using system dynamics), including Madachy’s PhD and later book “Software Process Dynamics”, and Buettner’s PhD (no book, yet???).

There are other books that include some combination of the words empirical, software and engineering in their title. On the whole, these are collections of edited papers, whose chapters are written by researchers promoting their latest work; there is even one that aims to teach students how to do empirical work.

Dag Sjøberg has done some interesting empirical work and is currently working on an empirical book, this should be worth a look.

“R in Action” by Kabacoff is the closest to the statistical material, but at a more general level. “The R Book” by Crawley is the R book I would recommend, but it is not at all like the material I have written.

Signed-magnitude: The integer representation of choice for IoT?

July 28, 2017 No comments

What is the best representation to use for integer values in a binary computer? I’m guessing that most people think two’s complement is the answer, because this is the representation that all the computers they know about use (the Univac 1100/2200 series uses one’s complement; I don’t know of any systems currently in use that make use of signed magnitude, pointers welcome).

The C Standard allows implementations to support two’s complement, one’s complement and signed magnitude (the Univac 1100/2200 series has a C compiler). Is it time for the C Standard to drop support for one’s complement and signed magnitude?.

Why did two’s complement ‘win’ the integer representation battle and what are the chances that hardware vendors are likely to want to use a different representation in the future?

The advantage of two’s complement over the other representations is that the same hardware circuits can be used to perform arithmetic on unsigned and signed integer values. Not a big issue these days, but a major selling point back when chip real-estate was limited.

I can think of one market where signed magnitude is the ‘best representation’, extremely low power devices, such as those that extract power from the radio waves permeating the environment, or from the vibrations people generate as they move around.

Most of the power consumed by digital devices occurs when a bit flips from zero to one, or from one to zero. An application that spends most of its time processing signals that vary around zero (i.e., can have positive and negative values) will experience many bit flips, using a two’s complement representation, when the value changes from positive to negative, or vice-versa, e.g., from 0000000000000001 to 0000000000000000 to 1111111111111111; in signed magnitude a change of sign generates one extra bit-flip, e.g., 0000000000000001 to 0000000000000000 to 1000000000000001.

Simulations show around 30% few transitions for signed magnitude compared with two’s complement, for certain kinds of problems.

Signed magnitude would appear to be the integer representation of choice for some Internet-of-Things solutions.