January 22, 2012 Derek Jones 1 comment

What I found most intriguing about Google Code Search (shutdown Nov 2011) was how quickly searches involving regular expressions returned matches. A few days ago Russ Cox, the implementor of Code Search not only explained how it worked but also released the source and some precompiled binaries. Google’s database of source code did not include the source of R, so I decided to install CodeSearch on my local machine and run some of my previous searches against the latest (v2.14.1) R source.

In 2007 I ran an experiment that showed developers made use of variable names when making binary operator precedence decisions. At about the same time two cognitive psychologists, David Landy and Robert Goldstone, were investigating the impact of spacing on operator precedence decisions (they found that readers showed a tendency to pair together the operands that were visibly closer to each other, e.g., a with b in a+b * c rather than b with c).

As somebody very interested in finding faults in code the psychologists research findings on spacing immediately suggested to me the possibility that ‘incorrectly’ spaced expressions were a sign of failure to write code that had the intended behavior. Feeding some rather complicated regular expressions into Google’s CodeSearch threw up a number of ‘incorrectly’ spaced expressions. However, this finding went no further than an interesting email exchange with Landy and Goldstone.

Time to find out whether there are any ‘incorrectly’ spaced expressions in the R source. cindex (the tool that builds the database used by csearch) took 3 seconds on a not very fast machine to process all of the R source (56M byte) and build the search database (10M byte; the Linux database is a factor of 5.5 smaller than the sources).

The search:

csearch "\w(\+|\-)\w +(\*|\/) +\w"

returned a few interesting matches:

...
modules/internet/nanohttp.c:       used += tv_save.tv_sec + 1e-6 * tv_save.tv_usec;
modules/lapack/dlapack0.f:     $          ( T*( ONE+SQRT( ONE+S / T ) ) ) )
modules/lapack/dlapack2.f:               S = Z( 3 )*( Z( 2 ) / ( T*( ONE+SQRT( ONE+S / T ) ) ) )
modules/lapack/dlapack4.f:     $          ( T*( ONE+SQRT( ONE+S / T ) ) ) )

There were around 15 matches of code like 1e-6 * var (because the pattern \w is for alphanumeric sequences and that is not a superset of the syntax of floating-point literals).

The subexpression ONE+S / T is just the sort of thing I was looking for. The three instances all involved code that processed tridiagonal matrices in various special cases. Google search combined with my knowledge of numerical analysis was not up to the task of figuring out whether the intended usage was (ONE+S)/T or ONE+(S/T).

Searches based on various other combination of operator pairs failed to match anything that looked suspicious.

There was an order of magnitude performance difference for csearch vs. grep -R -e (real 0m0.167s vs. real 0m2.208s). A very worthwhile improvement when searching much larger code bases with more complicated patterns.

Categories: Uncategorized Tags: binary operator, Google, precedence, psychology, R whitespace, regular expression, searching code

Searching code for a specific kind of calculation

December 27, 2008 Derek Jones No comments

I am currently involved in a project that requires locating the line(s) of code in a program that calculate(s) the value 3n+1 (and various other constructs associated with coding the 3n+1 problem). Since there are over 4,000 independently written programs, each containing this calculation, the search effort is non-trivial. The obvious solution is to use grep to search for the expression 3*n+1 (the actual regular expression is a bit more complicated since any whitespace needs to be handled and the identifier n might have a different spelling).

This obvious solution works around 80% of the time (based on my manual analysis of the programs; searching for n*3+1 catches another 10%). For many of the authors of these 4,000+ programs simplicity does not seem to be an overriding goal and various alternative forms of the calculation are used (e.g., n+n+n or (n<<2)+n+1 or n+= (n<<2); n++ or even n+=n+++n). It looks like some authors have been unduly concerned with program performance.

The reason I am doing a manual search is that this problem is way beyond the capabilities of existing code search tools. Existing tools all require that the search pattern be specified in terms that are essential lexical. This would not be too much of a problem if I had a means of automatically enumerating a reasonable subset of the expressions that evaluate to 3n+1. (The problem of optimizing the sequence of operations needed to multiply a variable by a constant is a well known issue in compiler code generation and very good algorithms are known (papers+code and matrix multiplication)).

The existing abstract interpretation tools target complete programs, or at least complete functions, and aim to show that certain conditions are met and/or not violated. An abstract interpretation version of grep sounds like an interesting PhD.

After several thousand searches even the most obtuse methods rarely take me more than a few seconds to spot. I can also be easily ‘reprogrammed’ to search just as effectively for other code sequences having some simple result.

Contemplating the major problems that need to be solved before an automatic tool could perform a similar task I am starting to appreciate, once again, the vast gulf that exists between human and computer analysis of code.

Categories: Uncategorized Tags: abstract interpretation, grep, searching code, tools

The Shape of Code

Archive

Relative spacing of operands affects perception of operator precedence

Searching code for a specific kind of calculation

Recent Posts

Recent Comments

Archives

Meta