June 11, 2010 Derek Jones 2 comments

If you are running a large project written in C and it is decided that in future developers will be allowed to use C++, are there any C++ constructs whose use should be recommended against?

The short answer is that use of any C++ construct in existing C code should be recommended against.

The technical advantages of using C++ come about through use of its large scale code organizational features such as namespaces, classes and perhaps templates. Unless an existing C code base is restructured to use these constructs there is no real technical benefit for moving to C++; there may be non-technical reasons for allowing C++, as I wrote about recently.

My experience of projects where developers have been given permission to start using C++ within an existing C code base has not been positive. Adding namespaces/classes is a lot of work and rarely happens until a major rewrite; in the meantime C++ developers use new instead of malloc (which means the code now has to be compiled by a C++ compiler for a purely trivial reason). In days of old // style comments were sometimes touted as a worthwhile benefit of using C++ (I kid you not; now of course supported by most C compilers), as was the use of inline (even though back then few C++ compilers did much actual inlining), and what respectable programmer would use a C printf when C++’s iostreams provided << (oh, if only camera phones had been available back in the day, what a collection of videos of todays experts singing the praises of iostreams I would have ;-)

There are situations where use of C++ in an existing code base makes sense. For example, if the GCC project is offered a new optimization phase written in C++, then the code will presumably have been structured to make use of the high level C++ code structuring features. There is no reason to turn this code down or request that it be rewritten just because it happens to be in C++.

Categories: Uncategorized Tags:

GCC moves to attract more developers

June 1, 2010 Derek Jones No comments

The GCC steering committee have just approved the use of C++ in GCC (I assume they are using GCC here to refer to gcc, not the Gnu Compiler Collection). On purely technical grounds it does not make much sense to allow developers to start adding C++ constructs to a large, established C code base, in fact I can think of some good reasons why this is a poor decision technically, but I don’t think this decision was based on technical issues.

Software is written by developers and developers have opinions (sometimes very strong ones) about which languages they are willing to use. Any project that relies on unpaid contributions (GCC does have a core of paid developers working on it) has to take developer language opinions into account. In fact even projects staffed purely by paid employees have to take language opinions into account; I was once heavily involved in the Pascal community and employers would tell me they had difficulty attracting staff because being seen as a Pascal developer would limit their future career prospects.

Over the years GCC has had a huge amount of input from various people’s PhD work. I suspect that today’s PhD graduate is much more likely to write in C++ than C and have little interest in a rewrite in C just to have it accepted into the gcc source tree.

What of up and coming developers interested in getting involved in a compiler project, are they willing to work in C? If they want to use C++ rather than C, then until Sunday the compiler project of choice for them would be LLVM.

The GCC steering committee have finally acknowledged that they need to allow the use of C++ in gcc if they are to attract a sufficient number of developers to work on it.

Categories: Uncategorized Tags:

Frequency of floating literals in a given range

May 31, 2010 Derek Jones No comments

Last month I talked about one idea for estimating the ‘interestingness’ of a floating-point literal, counting the number of digits it contains. Another idea is to use the magnitude of the literal’s value; many values seem to cluster in the range -1 to +3 and perhaps a match against a value in this range ought to be filtered in some way.

Assume the two values 1.2 and 1200.0 are in the interesting number database. Both contain the same number of non-zero digits. More matches are likely to occur against the value 1.2, but this does not mean its false positive rate is higher compared to the value 1200 (that information can only come from knowledge of the application domain of the files whose contents are being matched).

Filtering based on the magnitude of a value might be used to reduce the total number of matches reported, e.g., only report the first match of a literal having a value within a certain range.

How often do floating-point literal values occur in source code? The following is based on 4.47 million non-zero floating-point literals (i.e., any literal having value 0.0 was not counted) in a wide variety of numeric source code:

Occurrences of floating-point literal values

The literal values were put into bins whose width were based on powers of 2. For instance, values between 0.25 and 0.5 went into bin -1, between 0.5 and 0 into bin 0, between 0 and 1 when into bin 1 and so on for smaller and larger numbers.

At 21.6% literals between 0.5 and 0.0 were the most common range, followed at 10.4% for literals between 0.0 and 1.0. The plot is slightly skewed towards having more values greater than 1.

Two discontinuities occur in the value frequencies between 0.001-0.02 and 16-64 (with the smaller values occupying a slightly larger range). This unexpected behavior has been added to my list of things to investigate at some point.

Categories: Uncategorized Tags:

Building directly from a .tgz file

May 15, 2010 Derek Jones 3 comments

Working on lots of different code bases means I am forever having to extract the contents of tar/zip files before compiling/analysing the files in the extracted directory tree, then deleting the directory tree when I am done. It is about time development tools such as make and compilers had the ability to build directly from an archive.

Vi (well actually Vim and other editors) supports the editing of files contained within an archive and thanks to libarchive the latest version of Numbers also has this functionality.

This is not about saving disc space, it is a way of working that creates a barrier between files created elsewhere and files created by me; it would make it harder for me to accidentally leave my work files in the directory I happened to be sitting, in the directory tree, when working on the source.

There are some design issues that need to be sorted out if build configuration via .configure files is to work correctly. I leave these design issues to the people who know about configuration management.

We will need to extend the existing directory-path/file syntax to support the specification of a file contained within an archive. How about using ::tgz:: as a file prefix to indicate that the subsequent directory/file specification is to be interpreted as referring to the contents of an archive file, e.g.,

cc /home/stuff/::tgz::app.tgz/src/foo/bar.c

I don’t think a separate prefix is needed for each kind of archive, any character sequence that is sufficiently unused at the moment will do.

Go readers, spread the word!

Categories: Uncategorized Tags:

SEC wants prospectus source code to be published

April 23, 2010 Derek Jones No comments

The US Securities and Exchange Commission are proposing new rules involving the prospectuses for public offerings of asset-backed securities including publishing the source code used to calculate the contractual cash flow provisions.

Requiring that the source code used to perform the financial modeling for a prospectus be made available is an excellent idea. A prospectus document contains a huge number of technical details and more importantly for anybody trying to understand the thinking behind it, a lots of assumption. Writing a program requires that all necessary details be enumerated and appropriately connected together and more importantly creating code that can be meaningfully executed usually means making explicit any assumptions that were previously implicit.

There are parallels here with having access to the source code and data used to make climate predictions.

The authors of the proposals are naive to think that simply requiring source to be written in a language for which there is an open source implementation (i.e., Python) is all they need to specify for others to duplicate the program output generated by the proposer (I have submitted some suggestions to the SEC about the issues that need to be addressed). The suggestions that a formally defined language be used is equally naive.

The availability of this source code opens up some interesting commercial prospects. No, not selling analysis tools to financial institutions but selling them program fault information, e.g., under circumstance X the program incorrectly predicts A will happen when in fact B will actually happen. Of course companies know this will happen and will put a lot more effort into ensuring that their models/code is correct.

Will these disclosure rules change the characteristics of financial software? One characteristic that I’m sure will change is the percentage of swear words in the comments and identifiers.

Categories: Uncategorized Tags: assumptions, climate-gate, faults, Python, SEC, source code

Brief history of syntax error recovery

April 19, 2010 Derek Jones 2 comments

Good recovery from syntax errors encountered during compilation is hard to achieve. The two most common strategies are to insert one or more tokens or to delete one or more tokens. Make the wrong decision and a second syntax error will occur, often leading to another and soon the developer is flooded by a nonsensical list of error messages. Compiler writers soon learn that their first priority is ensuring that syntax error recovery does not result in lots of cascading errors. In languages that use a delimiter to indicate end of statement/declaration, usually a semicolon, the error recovery strategy of deleting all tokens until this delimiter is next encountered is remarkably effective.

The era of very good syntax error recovery was the 1970s and early 1980s. Developers working on mainframes might only be able to achieve one or two compilations per day on a batch oriented mainframe and they were not happy if a misplaced comma or space resulted in a whole day being wasted. Most compilers were rented for lots of money and customer demand resulted in some very fancy error recovery strategies.

Borland’s Turbo Pascal had a very different approach to handling errors in code, it stopped processing the source as soon as one was detected. The combination of amazing compilation rates and an interactive environment (MS-DOS running on the machine in front of the developer) made this approach hugely attractive.

To a large extent syntax error recovery has been driven by the methods commonly used to write parsers. Many compilers use a table driven approach to syntax analysis with the tables being generated by parser generator tools such as Yacc. During the 1970s and 80s a lot of the research on parser generators was aimed at reducing the size of the generated tables. A table of 10k bytes was a significant percentage of available storage for machines that supported a maximum of 64k of memory. Some parser table compression techniques involve assuming the default behavior and then handling any special cases when these defaults are found not to apply, but one consequence is that context information needed for good error recovery is often not available when an error is detected. The last major release of Yacc from AT&T in the early 1990s managed another reduction in table size, just as typical storage sizes were getting into the ten of megabytes, but at the expense of increasing the difficulty of doing good error recovery.

While there are still some application areas where the amount of storage occupied by parser tables is still a big issue, e.g., the embedded market, developers of parser generators such as Bison ought to start addressing the needs of users wanting to do good error recovery and who are willing to accept larger tables.

I am pleased to see that the LLVM project is making an effort to provide good syntax error recovery. A frustrating barrier to providing better error recovery is lack of information on the kinds of syntax errors commonly made by developers; there are a few papers and reports containing small scale measurements of errors made by students. Perhaps the LLVM developers will provide a mechanism for automatically collecting compilation errors and providing users with the option to send the results to the LLVM project.

One of my favorite syntax error recovery techniques (implemented in a PL/1 mainframe compiler; I have never been able to justify implementing it on any project I worked on) is the following:

// Use of an undeclared identifier is a syntax error in C and some other
// languages, while in other languages it is a semantic error.
 
// no identifier with name result visible here
 
   {
   int result;
   ...
   result=...
   ...
   }
...
calc=result*2;  // Error reported by most compilers is use of an undeclared variable

The ‘real’ error is probably the misplaced closing bracket. Other possibilities include result being a misspelled version of another variable or the assignment to calc being in the wrong place.

There seems to be a trend over the last 20 years to create languages that require more and more semantic information during parsing. Deciphering a syntax error today can involve a lot more than figuring out which surrounding tokens have been omitted or misplaced, information on which types are in scope and visible (oh for the days when that meant the same thing) and where they might be found in the umpteen thousand lines of included source has to be distilled and presented to the developer in a helpful message.

For a long time compilers have primarily been benchmarked on the quality of their code. With every diminishing returns from improved optimization, the increasing complexity of languages and the increasing volume of header code pulled in during compilation perhaps the quality of syntax error recovery will grow in importance.

Categories: Uncategorized Tags: compiling, error recovery, mainframe, parsing, syntax error

Number of digits in floating-point literals

April 18, 2010 Derek Jones No comments

Some of the interesting floating-point literals detected by the numbers program not only look uninteresting but plain wrong. For instance, almost every program I analyze appears to contain a literal denoting the ratio of the diameter of the Earth to at least one minor planet. One problem is that most of the numbers contained in the interesting number database are only likely to occur in very specific circumstances and as the size of this database grows the percentage of inappropriate matches grows.

I could (and at some point probably will have to) assign an interestingness level to numbers, but this goes against one of the original aims of identifying the operations performed by unknown source.

An alternative idea is to create a connection between the fuzziness of the matching process and the probability of the literal being encountered in code. For instance, a more exact match might be required for 0.5 because it contains few mantissa digits and sits within a range of values that are commonly encountered, while a much fuzzier match might be used for 1.879623e+3 because it contains more digits and occupies a less commonly encountered range of values.

Floating-point literals often contain leading or trailing zeros, e.g., 0.001, 100.0, 1e+2
or 0.50. Does the presence of these zeros change the probability of a particular mantissa being encountered? For instance the literals 100.0 and 1e+2 have the same numeric value but different numbers of mantissa digits.

Another issue is developer intent. Why did a developer write 0.50, did they simply want two digits to appear after the decimal point because the surrounding literals in the source contain two digits and it makes the visual appearance look better or does this usage denote a quantity whose accuracy is known to two decimal digits?

The following figure is derived from 1 million non-zero floating-point literals contained in ten large, computationally intensive programs.

Significant digits in a floating-point literal

The dashed line denotes the percentage of mantissas containing a given number of characters, including leading/trailing zeros and any decimal point.

The two solid lines denote the digit count of the mantissas with any leading/trailing zeros removed, along with any decimal point, e.g., both 100.0 and 1e+2 would be considered to contain one digit.

It seemed to me that floating-point literals appearing within an initializer attached to a variable definition often contain more digits than literals that appear elsewhere. The solid, triangle tagged, solid line that spends most of its time around 5% are floating-point literals appearing within an initializer (to be exact they are literals separated from another literal by a comma {with some simplistic handling of Fortran line continuation}). The bullet tagged line are all other literals.

I was partially right about the characteristics of floating-point literals in initializers. It turns out the probability of encountering a mantissa containing a given number of digits is approximately constant within an initializer (a more sophisticated analysis might show an upward trend with increasing numbers of digits).

The mantissa digit count outside of initializers has the kind of probability distribution I was looking for. Hopefully this distribution will contribute to a useful measure of interestingness.

Categories: Uncategorized Tags: digits, floating-point, intent, literal mantissa

What language was an executable originally written in?

April 11, 2010 Derek Jones No comments

Apple have recently added an unusual requirement to the iPhone developer agreement “Applications must be originally written in Objective-C, C, C++, or JavaScript …”. As has been pointed out elsewhere the real purpose is stop third party’s from acquiring any control over application development on Apple’s products; the banning of other languages is presumably regarded as acceptable collateral damage.

Is it possible to tell by analyzing an executable what language it was originally written in?

There are two ways in which executables contain source language ‘signatures’. Detecting these signatures requires knowledge of specific compiler behavior, i.e., a database of information about the behavior of compilers capable of creating the executables is needed.

Runtime library. Most compilers make use of a language specific runtime library, rather than generating inline code for some kinds of functionality. For instance, setjmp/longjmp in C and vtables in C++.

The presence of a known C runtime library does not guarantee that the application was originally written in C; it could have been written in Java and converted to C source.

The absence of a known C runtime library could mean that the source was compiled by a C compiler using a runtime system unknown to the analyzer.

The presence of a known Java, for instance, runtime library would suggest that the original source contained some Java. This kind of analysis would obviously require that the runtime library database not restrict itself to the ‘C’ languages.

Compiler behavior patterns. There is usually more than one way in which a source language construct can be translated to machine code and a compiler has to pick one of them. The perfect optimizing compiler would always make the optimal choice, but real compilers follow a fixed pattern of code generation for at least some language constructs (e.g., initialization of registers on function entry).

The presence of known code patterns in an executable is evidence that a particular compiler has been used; how much depends on the likelihood it could have been generated by other means and how many other patterns suggest the same compiler. In the case of the GNU Compiler Collection the source language might also be Fortran, Java or Ada; I don’t know enough about the behavior of GCC to provide an informed estimate of whether it is possible to recognize the source language from the translated form of constructs shared by several languages, I suspect not.

The fact that an executable can be decompiled to C is not a guarantee that it was originally written in C.

Some languages support source language constructs whose corresponding machine code is unlikely to ever be generated by source from another language. The Fortran computed goto allows constructs to be written that have no equivalent in the other languages supported by GCC (none of them allow statement labels appearing in a multi-way jump to appear before the jump test):

10    I=I+1
20    J=J+1
       goto (10, 20, 30, 40) J
30    I=I+3
40    I=I*2

The presence of a compiled form of this kind of construct in the executable would be very suggestive of Fortran source.

Apple are famously paranoid and control freakery. It will be very interesting to see what level of compliance checking they decide to perform on executables submitted to the App Store.

On another note: What does “originally written” mean? For instance, many of the mathematical functions (e.g., sine, log, gamma, etc) contained in R were originally written in Fortran and translated to C for use in R; this C source is what is now maintained. Does this historical implementation decision mean that R cannot be legally ported to the iPhone?

Categories: Uncategorized Tags: Apple, C, executable, Fortran, iPhone, Java, library, Objective-C, runtime, Source language

Program analysis via information leakage

March 31, 2010 Derek Jones No comments

The use of software in high value transactions has created an interesting new field of software research that investigates the leakage of information from programs. The kind of information leaked, so-called sideband information, can take various forms, including:

The amount of time taken to perform some operation. Many developers instinctively do their best to ensure that code does not take any longer to execute than it has to. In the case of one commonly used authentication system, the time taken to fail to authenticate an encryption key provided useful information on how close one trial encryption-key was compared to another (the closer the trial key to the actual key, the longer the authentication took to fail). The obvious implementation technique to foil this kind of attack is to add random delays into the authentication process.
It has even proved possible to perform timing attacks against a remote machine over the Internet to remote
Use of some part of the value of secure information, by a system library function, to create the value passed back to the caller, e.g.,
if (secret_value & 0xf000) // Tell the caller that the top 'secret' four bits are set return 1; else return 0;
if (secret_value & 0xf000) // Tell the caller that the top 'secret' four bits are set return 1; else return 0;
Researchers have been able to analyse the information flow of input values through some very large C programs.
Analyse of network traffic routing information to work out who is talking to who. Various kinds of anonymizers have been created in attempt to make various forms of Internet traffic untraceable.

Any Internet program is accessible to information flow analysis. Using these techniques to analyse the search algorithm used by Google might be overly ambitious. A Google algorithm that might be within reach of is the one used by Adwords; the behavior of this algorithm is of interest to a growing number of people.

Information leakage techniques are becoming more widely known and developers working on programs containing a security component now need to consider how they can prevent information being leaked to attackers who sample program behavior looking for exploitable weaknesses.

Categories: Uncategorized Tags: Google, Information leakage, security, the future, timing

Variations in the literal representation of Pi

March 12, 2010 Derek Jones No comments

The numbers system I am developing attempts to match numeric literals contained in a file against a database of interesting numbers. One of the things I did to quickly build a reasonably sized database of reliable values was to extract numeric literals from a few well known programs that I thought I could trust.

R is a widely used statistical package, and Maxima is a computer algebra system with a long history. Both contain a great deal of functionality and are actively maintained.

To my surprise, the source code of both packages contain a large variety of different literal values for , or to be exact, the number of digits contained in the literals varied by more than I expected. In the following table, the value to the left of the representation is the number of occurrences; values listed in increasing literal order:

     Maxima                              R
   2 3.14159
                                      14 3.141592
   1 3.1415926
   1 3.14159265                        2 3.14159265
   3 3.1415926535
   4 3.14159265358979
  14 3.141592653589793
   3 3.1415926535897932385             3 3.1415926535897932385
   9 3.14159265358979324
                                       1 3.14159265359
                                       1 3.1415927
                                       1 3.141593

The comments in the Maxima source led me to believe that some thought had gone into ensuring that the numerical routines were robust. Over 3/4 of the literal representations of have a precision comparable to at least that of 64-bit floating-point (I’m assuming an IEEE 754 representation in this post).

In the R source, approximately 2/3 of the literal representations of have a precision comparable to that of 32-bit floating-point.

Closer examination of the source suggests one reason for this difference. Both packages make heavy use of existing code (translated from Fortran to Lisp for Maxima and from Fortran to C for R); using existing code makes good sense and because of its use in scientific and engineering applications many numerical libraries have been written in Fortran. Maxima has adapted the slatec library, whereas the R developers have used a variety of different libraries (e.g., specfun).

How important is variation in the representation of Pi?

A calculation based on a literal that is only accurate to 32-bits is likely to be limited to that level of accuracy (unless errors cancel out somewhere).
Inconsistencies in the value used to represent Pi are a source of error. These inconsistencies may be implicit, for instance literals used to denote a value derived from such as often seem to be based on more precise values of Pi than appear in the code.

The obvious solution to this representation issue of creating a file containing definitions of all the frequently used literal values has possible drawbacks. For instance, numerical accuracy is a strange beast, and increasing the precision of one literal without doing the same for other literals appearing in a calculation can sometimes reduce the accuracy of the final result.

Pulling together existing libraries to build a package is often very cost-effective, but numerical accuracy is a slippery beast, and this inconsistent usage of literals suggests that developers from these two communities have not addressed the system level consequences of software reuse.

Update 6 April: After further rummaging around in the R source distribution, I found that things are not as bad as they first appear. Only two of the single precision instances of listed above occur in the C or Fortran source code, the rest appear in support files (e.g., m4 scripts and R examples).

Categories: Uncategorized Tags: floating-point, Fortran, Maxima, Pi, R, testing

Newer Entries Older Entries

The Shape of Code

Archive

Any technical benefits to C++ in a C project?

GCC moves to attract more developers

Frequency of floating literals in a given range

Building directly from a .tgz file

SEC wants prospectus source code to be published

Brief history of syntax error recovery

Number of digits in floating-point literals

What language was an executable originally written in?

Program analysis via information leakage

Variations in the literal representation of Pi

Recent Posts

Recent Comments

Archives

Meta