Archive

Posts Tagged ‘measuring’

Best tool for measuring lots of source code

November 2, 2025 No comments

Human written source code contains various common usage patterns. This blog has analysed a variety of these patterns, and in a few cases built models of processes that replicate these patterns. The data for this analysis has primarily comes from programs written in C and Java, because these are the languages that researchers most often study (tool availability and herd mentality).

Do these common usage patterns occur in other languages, or at least other C/Java like languages? I think so, and have set out to collect the necessary data. Obtaining this data requires large quantities of code written in many languages, and the ability to analyse code written in these languages.

GitHub contains huge quantities of code. There are two freely available source code analysis tools supporting many languages: Opengrep (the Open source version of semgrep) and CodeQL.

CodeQL’s method of operation had previously put me off trying it. The method is a two stage process: First a database of information is created by extracting information during a project’s build process (e.g., running existing makefiles and host compilers), followed by querying this database using a declarative language (think minimalist SQL with lots of built-in functions). This approach has the huge advantage of not having to worry about handling compiler dialects/options, however, I’m an ingrained user of tools that process individual files.

From the research perspective, CodeQL has a major feature that is not available with other tools. GitHub, who now owns CodeQL, host thousands of project databases and GitHub Actions allows third-parties to scan up to 1,000 databases of the most popular projects. Access to existing CodeQL databases removes the need to download repo/build project/store database locally.

CodeQL, like other static analysis tools, was designed to find issues/problems in code, and so might not support the kind of functionality I needed to extract source code measurements. The best way to find out if the data of interest could be extracted is to try and do it.

In the best developer tradition, I downloaded a prebuilt release (available for Linux, Windows and Mac; called CodeQL Bundles), skimmed the documentation, ran a simple QL script and spent an hour or two trying to figure out why I was getting Java runtime errors, e.g., “no String-argument constructor/factory method to deserialize from String value“.

Progress would have been faster if I had used Visual Studio Code, available free from the owners of GitHub, rather than the command line. The documentation is not command line oriented. Visual Studio handles details like creating a qlpack.yml file (whose necessary existence I eventually found out about). Also, the harmless looking metadata appearing in comments is necessary and had better match the output parameters of the query. How hard is it to warn that a file could not be found, or that metadata is missing?

The code databases are queried using the declarative language QL, which is a kind of minimal SQL (with the select appearing last, rather than first). The import statement specifies the language, or rather the name of a library module.

The imported library contains classes for each language construct (e.g., BlockStmt, Function, ArrayExpr, etc). In the query below, the line “from LocalScopeVariable lv” extracts all local scope variables, which can subsequently be referred to via the name lv. The where line lists conditions that must be met (in this example, not be a parameter and not be accessed; testing for unused variables). The select line invokes methods that return various kinds of information about the class, e.g., the name of the variable, and location within the source.

   /**
    * @id compound-stmt
    * @kind problem
    * @problem.severity warning
    */
 
   import cpp
 
   from LocalScopeVariable lv
   where
     not lv instanceof Parameter and
     not exists(lv.getAnAccess())
   select "", ""+lv.getName()+
          ","+lv.getLocation().getStartLine()+
          ","+lv.getLocation().getEndLine()+
          ","+lv.getEnclosingFunction()+","+bs.getFile()

The output generated is driven by the select, whose number/kind of arguments must match that specified by the metadata.

Developers can write and call functions, such as this one:

   predicate header_suffix(string fstr)
      { fstr = "h" or fstr = "H" or fstr = "hpp" }

The QL language is a declarative logical query language with roots in Datalog (subset of Prolog). The claim that it is an object-oriented language is technically correct, in that it groups functions into things called classes and supports various constructs usually found in object-oriented languages. The language has the feel of an academic project that happened to be used in a tool that was in the right place at the right time. Using host compilers to enable the tool to support many languages must have been very attractive to GitHub.

Coding in a declarative logic language requires a major mindset change. There are no loops, if statements or assignments. The query is one, potentially very long and complicated, predicate. A mindset change is necessary, but not sufficient, some fluency with the library of functions available is also needed. For instance, the isSideEffectFree predicate is true/false, but does not return a value (so there is nothing to print). I wanted to output 0/1, depending on whether a function was side effect free or not. When asked, all the LLMs questioned insisted that QL supported if-statements and assignment, just like other languages. After lots of dead-ends, an LLM claimed that “CodeQL automatically treats boolean expressions in count as 1/0″, and a test run showed this to be the case:

   count(int dummy | dummy = 1 and func.isSideEffectFree() | dummy)

The QL scripts needed to extract all the data of immediate interest to me were easily implemented. Looking at existing scripts has given me some ideas for more patterns I might measure. CodeQL currently supports 10 languages, and their classes appear to be slightly different (my initial focus is C, C++, Java and Python).

Visual Studio Code is required to run multi-repository variant analysis, i.e., scan up to 1,000 project databases on GitHub. It was after installing the CodeQL extension that I discovered how much smoother the process is within this IDE, compared to the command line (and off course the output is slightly different). There may be alternatives to Visual Studio, but I’m sticking with what the official documentation says.

Stepping back, is CodeQL a useful tool?

For me it is currently very useful, because of the large number of project databases. Some practice is needed to achieve some fluency in the use of a declarative logic language, not a major hurdle.

The need to run queries against a project database may be a major inconvenience for some developers, depending on working practices. Those practicing continuous integration should be ok.

Coronavirus: a silver lining for evidence-based software engineering?

March 22, 2020 No comments

People rarely measure things in software engineering, and when they do they rarely hang onto the measurements; this might also be true in many other work disciplines.

When I worked on optimizing compilers, I used to spend time comparing code size and performance. It surprised me that many others in the field did not, they seemed to think that if they implemented an optimization, things would get better and that was it. Customers would buy optimizers without knowing how long their programs took to do a task, they seemed to want things to go faster, and they had some money to spend buying stuff to make them feel that things had gotten faster. I quickly learned to stop asking too many questions, like “how fast does your code currently run”, or “how fast would you like it to run”. Sell them something to make them feel better, don’t spoil things by pointing out that their code might already be fast enough.

In one very embarrassing incident, the potential customer was interested in measuring performance, and my optimizer make their important program go slower! As best I could tell, the size of the existing code just fitted in memory, and optimizing for performance made it larger; the system started thrashing and went a lot slower.

What question did potential customers ask? They usually asked whether particular optimizations were implemented (because they had read about them someplace). Now some of these optimizations were likely to make very little difference to performance, but they were easy to understand and short enough to write articles about. And, yes. I always made sure to implement these ‘minor’ optimizations purely to keep customers happy (and increase the chances of making a sale).

Now I work on evidence-based software engineering, and developers rarely measure things, and when they do they rarely hang onto the measurements. So many people have said I could have their data, if they had it!

Will the Coronavirus change things? With everybody working from home, management won’t be able to walk up to developers and ask what they have been doing. Perhaps stuff will start getting recorded more often, and some of it might be kept.

A year from now it might be a lot easier to find information about what developers do. I will let you know next year.

Information content of expressions

December 11, 2009 No comments

Software developers read source code to obtain information. How might the information content of source code be quantified?

Both of the following functions assign the same value to x and if that is the only information a reader of that code is interested in, then the information content of both assignment statements could be said to be the same.

int foo(void)
{
x = 5;
...
}
 
int bar(void)
{
x = 2 + 3;
...

A reader seeking deeper understanding of the above code would ask why the value 5 is built from two values in bar. One reason might be that the author of the function wanted to explicitly call out background information about how the value 5 was derived (this is often done using symbolic names, but the use of literals themselves is sometimes encountered). Perhaps the author of foo did not see the need to expose this information or perhaps the shared value is purely coincidental.

If the two representations denote the same quantity doesn’t the second have a greater information content for a reader seeking deeper understanding?

In the following example:

... x + y & z ...
 
...
 
... num_red + num_white & lower_bits ...

an experienced developer with a knowledge of English is likely to interpret the expression as adding the number of occurrences of two quantities and using bit-wise AND to extract the lower bits. For some readers the second expression has a higher information content. Would use of the names number_of_red further increase the information content?

In the following example the first expression has not added any information that was not already present in the first expression above (except perhaps that the author was not certain of the precedence or perhaps did not expect subsequent readers to be certain).

... ( x + y ) & z ...
 
...
 
... x + ( y & z ) ...

The second expression uses parenthesis to achieve an operand/operator binding that is different from the default. Has this changed the information content of the expression?

There is experimental evidence that developers extract information from the names of variables to help them make decisions about operator precedence. To me the name all_32_bits_one suggests a sequence of bits and I would expect such a representation to be associated with the bit-wise AND operator, not binary plus. With no knowledge of the relative precedence of the two operators in the following expression the name of the middle operand would cause me to misinterpret the code. Does this change the information content of the expression? Does knowledge of the experimental evidence and the correct operator precedence change the information content (i.e., there is a potential fault in the code because the author may have assumed the incorrect precedence)?

... num_red + all_32_bits_one & sign_bit ...

There is experimental evidence that people use the amount of whitespace appearing between operands and their operators to visually highlight operator precedence

The relative quantities of whitespace used in the following two expressions appear to tell very different stories. Do the two expressions have a different information content?

... x  +  y & z ...
 
...
 
... x + y  &  z ...

The idea of measuring the information content of source code is very enticing. However, an accurate measure requires knowledge of the kind of information a reader is trying to obtain and of information that already exists in their brain.

Another question is the easy with which information can be extracted from code. Something that might be labeled as readability, except that readability has connotations of there being an abundant supply of information to extract.

Using Coccinelle to match if sequences

August 31, 2009 No comments

I have been using Coccinelle to obtain measurements of various properties of C if and switch statements. It is rare to find a tool that does exactly what is desired, but it is often possible to combine various tools to achieve the desired result.

I am interested in measuring sequences of if-else-if statements and one of the things I wanted to know was how many sequences of a given length occurred. Writing a pattern for each possible sequence was the obvious solution, but what is the longest sequence I should search for? A better solution is to use a pattern that matches short sequences and writes out the position (line/column number) where they occur in the code, as in the following Coccinelle pattern:

@ if_else_if_else @
expression E_1, E_2; 
statement S_1, S_2, S_3;
position p_1, p_2;
@@
if@p_1 (E_1)
   S_1
else if@p_2 (E_2)
   S_2
else
   S_3
@
script:python @ expr_1 << if_else_if_else.E_1;
                expr_2 << if_else_if_else.E_2;
                loc_1 << if_else_if_else.p_1;
                loc_2 << if_else_if_else.p_2;
              @@
print "--- ifelseifelse"
print loc_1[0].line, " ", loc_1[0].column, " ", expr_1
print loc_2[0].line, " ", loc_2[0].column, " ", expr_2

noting that in a sequence of source such as:

if (x == 1)
   stmt_1;
else
   if (x == 2)
      stmt_2;
   else
      if (x == 3)
         stmt_3;

the tokens if (x == 2) will be matched twice, the first setting the position metavariable p_2 and then setting p_1. An awk script was written to read the Coccinelle output and merge together adjacent pairs of matches that were part of a longer if-else-if sequence.

The first pattern did not concern itself with the form of the controlling expression, it simply wrote it out. A second set of patterns was used to match those forms of controlling expression I was interested in, but first I had to convert the output into syntactically correct C so that it could be processed by Coccinelle. Again awk came to the rescue, converting the output:

--- ifelseifelse
186   2   op == FFEBLD_opSUBRREF
191   7   op == FFEBLD_opFUNCREF
--- ifelseifelse
1094   3   anynum && commit
1111   8   ( c [ colon + 1 ] == '*' ) && commit

into a separate function for each matched sequence:

void f_1(void) {
// --- ifelseifelse
/* 186   2 */   op == FFEBLD_opSUBRREF ;
/* 191   7 */   op == FFEBLD_opFUNCREF ;
}
void f_2(void) {
// --- ifelseifelse
/* 1094   3 */   anynum && commit ;
/* 1111   8 */   ( c [ colon + 1 ] == '*' ) && commit ;
}

The Coccinelle pattern:

@ if_eq_1 @
expression E_1;
constant C_1, C_2;
position p_1, p_2;
@@
 
E_1 == C_1@p_1 ;
E_1 == C_2@p_2 ;
 
@
script:python @ expr_1<< if_eq_1.E_1;
                const_1 << if_eq_1.C_1;
                const_2 << if_eq_1.C_2;
                loc_1 << if_eq_1.p_1;
                loc_2 << if_eq_1.p_2;
              @@
print loc_1[0].line, " ", loc_1[0].column, " 3 ", expr_1, " == ", const_1
print loc_2[0].line, " ", loc_2[0].column, " 2 ", expr_1, " == ", const_2

matches a sequence of two statements which consist of an expression being compared for equality against a constant, with the expression being identical in both statements. Again positions were written out for post-processing, i.e., joining together matched sequences.

I was interested in any sequence of if-else-if that could be converted to an equivalent switch-statement. Equality tests against a constant is just one form of controlling expression that meets this requirement, another is the between operation. Separate patterns could be written and run over the generated C source containing the extracted controlling expressions.

Breaking down the measuring process into smaller steps reduced the amount of time needed to get a final result (with Coccinelle 0.1.19 the first pattern takes round 70 minutes, thanks to Julia Lawall‘s work to speed things up, an overhead that only has to occur once) and allows the same controlling expression patterns to be run against the output of both the if-else-if and if-if patterns.

At the end of this process I ended up with a list information (line numbers in source code and form of controlling expression) on if-statement sequences that could be rewritten as a switch-statement.