The changing shape of code in the next decade
I think there are two forces that will have a major impact on the shape of code in the next decade:
- Asian developers. China and India each have a population that is more than twice as large as Europe and the US combined, and software development has been kick-started in these countries by a significant amount of IT out sourcing. I have one comparative data point for software developers who might be of the hacker ilk. A discussion of my C book on a Chinese blog resulted in a download volume that was 50% of the size of the one that occurred when the book appeared as a news item on Slashdot.
- Scripting languages. Software is written to solve a problem, and there are only so many packaged applications (COTS or bespoke) that can profitably be supported. Scripting languages are generally designed to operate within one application domain, e.g., Bash, numerical analysis languages such as R and graphical plotting languages such as gnuplot.
While markup languages are very widely used they tend to be read and written by programs not people.
Having to read code containing non-alphabetic characters is always a shock the first time. Simply having to compare two sequences of symbols for equality is hard work. My first experience of having to do this in real time was checking train station names once I had travelled outside central Tokyo and the names were no longer also given in Romaji.
其中,ul分别是bootmap_size(bit map的size),start_pfn(开始的页框) max_low_pfn(被内核直接映射的最后一个页框的页框号) ; |
Developers based in China and India have many different cultural conventions compared to the West (and each other) and I suspect that these will affect the code they write (my favorite potential effect involves treating time vertically rather than horizontally). Many coding conventions used by a given programming language community exist because of the habits adopted by early users of that language, these being passed on to subsequent users. How many Chinese and Indian developers are being taught to use these conventions, are the influential teachers spreading different conventions? I don’t have a problem with different conventions being adopted, other than that having different communities using different conventions increases the cost for one community to adopt another community’s source.
Programs written in a scripting language tend to be much shorter (often being contained within a single file) and make use of much more application knowledge than programs written in general purpose languages. Their data flow tends to be relatively simple (e.g., some values are read/calculated and passed to a function that has some external effect), while the relative complexity of the control flow seems to depend on the language (I only have a few data points for both assertions).
Because of their specialized nature, most scripting languages will not have enough users to support any kind of third party support tool market, e.g., testing tools. Does this mean that programs written in a scripting language will contain proportionally more faults? Perhaps their small size means that only a small number of execution paths are possible, and these are quickly exercised by everyday usage (I don’t know of any research on this topic).
The Met Office ‘climategate’ Perl code
In response to the Climategate goings on the UK Meteorological Office has released a subset of its land surface climate station records and some code to process it. The code consists of 397 lines of Perl (station_gridder.perl and pretty printer and kind of implies more than one person doing the editing. And why are some variables names capitalized and other not (the names in subroutine read_station
are all lower case, while the names in the surrounding subroutines are mostly upper case)? More than one author is the simplest answer.
One Perl usage caught my eye, the construct unless
is rarely used and often recommended against. Without a lot more code being available for analysis there are no obviously conclusions to draw from this usage (apart from it being an indicator of somebody who knows Perl well, most mainstream languages do not support this construct and developers have to use a ‘positive’ construct containing a negated condition rather than a ‘negative’ construct containing a positive condition).
Parsing Fortran 95
I have been looking at doing some dimensional analysis of the Climategate code and so needed a Fortran parser.
The last time I used Fortran in anger the modern compilers were claiming conformance to the 1977 standard and since then we have had Fortran 90 (with a minor revision in 95) and Fortran 03. I decided to take the opportunity to learn something about the new features by writing a Fortran parser that did not require a symbol table.
The Eli project had a Fortran 90 grammar that was close to having a form acceptable to bison and a few hours editing and debugging got me a grammar containing 6 shift/reduce conflicts and 1 reduce/reduce conflict. These conflicts looked like they could all be handled using glr parsing. The grammar contained 922 productions, somewhat large, but I was only interested in actively making use of parts of it.
For my lexer I planned to cut and paste an existing C/C++/Java lexer I have used for many projects. Now this sounds like a fundamental mistake, these languages treat whitespace as being significant while Fortran does not. This important difference is illustrated by the well known situation where a Fortran lexer needs to lookahead in the character stream to decide whether the next token is the keyword do
or the identifier do5i
(if 1
is followed by a comma it must be a keyword):
do 5 i = 1 , 10 do 5 i = 1 . 10 ! assign 1.10 to do5i 5 continue |
In my experience developers don’t break up literals or identifier names with whitespace, and so I planned to mostly ignore the whitespace issue (it would simplify things if some adjacent keywords were merged to create a single keyword).
In Fortran the I/O is specified in the language syntax, while in C like languages it is a runtime library call involving a string whose contents are interpreted at runtime. I decided to ignore I/O statements by skipping to the end of line (Fortran is line oriented).
Then the number of keywords hit me, around 190. Even with the simplifications, I had made writing a Fortran lexer look like it would be a lot of work; some of the keywords only had this status when followed by a =
and I kept uncovering new issues. Cutting and pasting somebody else’s lexer would probably also involve a lot of work.
I went back and looked at some of the Fortran front ends I had found on the Internet. The GNU Fortran front-end was a huge beast and would need serious cutting back to be of use. moware was written in Fortran and used the traditional six character abbreviated names seen in ‘old-style’ Fortran source and not a lot of commenting. The Eli project seemed a lot more interested in the formalism side of things, and Fortran was just one of the languages they claimed to support.
The Open Fortran Parser looked very interesting. It was designed to be used as a parsing skeleton that could be used to produce tools that processed source, and already contained hooks that output diagnostic output when each language production was reduced during a parse. Tests showed that it did a good job of parsing the source I had, although there was one vendor extension used quite often (and not documented in their manual). The tool source, in Java, looked straightforward to follow, and it was obvious where my code needed to be added. This tool was exactly what I needed 🙂
Information content of expressions
Software developers read source code to obtain information. How might the information content of source code be quantified?
Both of the following functions assign the same value to x
and if that is the only information a reader of that code is interested in, then the information content of both assignment statements could be said to be the same.
int foo(void) { x = 5; ... } int bar(void) { x = 2 + 3; ... |
A reader seeking deeper understanding of the above code would ask why the value 5
is built from two values in bar
. One reason might be that the author of the function wanted to explicitly call out background information about how the value 5
was derived (this is often done using symbolic names, but the use of literals themselves is sometimes encountered). Perhaps the author of foo
did not see the need to expose this information or perhaps the shared value is purely coincidental.
If the two representations denote the same quantity doesn’t the second have a greater information content for a reader seeking deeper understanding?
In the following example:
... x + y & z ... ... ... num_red + num_white & lower_bits ... |
an experienced developer with a knowledge of English is likely to interpret the expression as adding the number of occurrences of two quantities and using bit-wise AND to extract the lower bits. For some readers the second expression has a higher information content. Would use of the names number_of_red
further increase the information content?
In the following example the first expression has not added any information that was not already present in the first expression above (except perhaps that the author was not certain of the precedence or perhaps did not expect subsequent readers to be certain).
... ( x + y ) & z ... ... ... x + ( y & z ) ... |
The second expression uses parenthesis to achieve an operand/operator binding that is different from the default. Has this changed the information content of the expression?
There is experimental evidence that developers extract information from the names of variables to help them make decisions about operator precedence. To me the name all_32_bits_one
suggests a sequence of bits and I would expect such a representation to be associated with the bit-wise AND operator, not binary plus. With no knowledge of the relative precedence of the two operators in the following expression the name of the middle operand would cause me to misinterpret the code. Does this change the information content of the expression? Does knowledge of the experimental evidence and the correct operator precedence change the information content (i.e., there is a potential fault in the code because the author may have assumed the incorrect precedence)?
... num_red + all_32_bits_one & sign_bit ... |
There is experimental evidence that people use the amount of whitespace appearing between operands and their operators to visually highlight operator precedence
The relative quantities of whitespace used in the following two expressions appear to tell very different stories. Do the two expressions have a different information content?
... x + y & z ... ... ... x + y & z ... |
The idea of measuring the information content of source code is very enticing. However, an accurate measure requires knowledge of the kind of information a reader is trying to obtain and of information that already exists in their brain.
Another question is the easy with which information can be extracted from code. Something that might be labeled as readability, except that readability has connotations of there being an abundant supply of information to extract.
Christmas books for 2009
I thought it would be useful to list the books that gripped me one way or another this year (and may be last year since I don’t usually track such things closely); perhaps they will give you some ideas to add to your Christmas present wish list (please make your own suggestions in the Comments). Most of the books were published a few years ago, I maintain piles of books ordered by when I plan to read them and books migrate between piles until eventually read. Looking at the list I don’t seem to have read many good books this year, perhaps I am spending too much time reading blogs.
These books contain plenty of facts backed up by numbers and an analytic approach and are ordered by physical size.
The New Science of Strong Materials by J. E. Gordon. Ideal for train journeys since it is a small book that can be read in small chunks and is not too taxing. Offers lots of insight into those properties of various materials that are needed to build things (‘new’ here means postwar).
Europe at War 1939-1945 by Norman Davies. A fascinating analysis of the war from a numbers perspective. It is hard to escape the conclusion that in the grand scheme of things us plucky Brits made a rather small contribution, although subsequent Hollywood output has suggested otherwise. Also a contender for a train book.
Japanese English language and culture contact by James Stanlaw. If you are into Japanese culture you will love this, otherwise avoid.
Evolutionary Dynamics by Martin A. Nowak. For the more mathematical folk and plenty of thought power needed. Some very powerful general results from simple processes.
Analytic Combinatorics by Philippe Flajolet and Robert Sedgewick. Probably the toughest mathematical book I have kept at (yet to get close to the end) in a few years. If number sequences fascinate you then give it a try (a pdf is available).
Probability and Computing by Michael Mitzenmacher and Eli Upfal. For the more mathematical folk and plenty of thought power needed. Don’t let the density of Theorems put you off, the approach is broad brush. Plenty of interesting results with applications to solving problems using algorithms containing a randomizing component.
Network Algorithmics by George Varghese. A real hackers book. Not so much a book about algorithms used to solve networking problems but a book about making engineering trade-offs and using every ounce of computing functionality to solve problems having severe resource and real-time constraints.
Virtual Machines by James E. Smith and Ravi Nair. Everything you every wanted to know about virtual machines and more.
Biological Psychology by James W. Kalat. This might be a coffee table book for scientists. Great illustrations, concise explanations, the nuts and bolts of how our bodies runs at the protein/DNA level.
Recent Comments