January 26, 2016 Derek Jones No comments

How well do I understand the process of writing source, as performed by human developers? If I could write a program that could generate source code that was indistinguishable from human written code, then I think I could claim to have a decent understanding of the human processes.

An important question is the skill of the entity attempting to distinguish automatically generated and human written code. As of today I think I could fool existing tools (mostly because such tools don’t really exist; n-gram analysis is really all there is), but a human developer with a smattering of coding experience would not be fooled.

I think I know enough to write a tool that could generate single function definitions that would fool most developers, but not a moderately sophisticated analysis tool (I don’t know enough to combine multiple functions into a plausible call graph).

At the CREST workshop today I met Martin Monperrus who shared my view that the only way to prove an understanding of source is to be able to generate it; other people at the workshop seemed more interested in the detection side of things.

Perhaps the time is ripe to kick off a generator/detector arms race would certainly increase our knowledge of the characteristics of human written code.

What counts as automatically generated code? Printing complete function definitions mined from Github is obviously not what most people would associate with the concept of automatic source generation. What about printing statements mined from Github (with each statement coming from a different function definition)? Well, if people could make that work, then good luck to them (it hardly seems worth the bother because most statements are very simple and easily generated on a token by token basis).

Should functions be judged in isolation, or should a detector get to see multiple instances of each case (which would make the detector’s life easier)? It might be best to go with whatever rules make for an interesting arms race.

To start the ball rolling, here is my first entry to the generator side of the race (a rather minimal entry, but still the leader in its field; the recurrent neural network trained on the Linux source would easily beat it, but my interest is in understanding developer behavior and will leave it to others to go down the neural network path).

Categories: Uncategorized Tags: automatic generation, human behavior, source code

Forces driving patterns found in call graphs of human written code

September 30, 2015 Derek Jones No comments

It has been a while since I posted anything about generating source that mimics the characteristics of human written code.

Generating function definitions is the easy bit, although variable selection is fiddly and naming needs to be handled.

The hardest part of mimicking human written code is linking functions together, via calls, in apparently meaningful ways. Within one source file it is probably possible to get away with individual calls to other local functions (most function definitions are called once) and a couple of calls to third party libraries. At the complete program level the code needs to tell a story, and doing this is very hard.

I think function calls can be divided into three categories, all based on the relationship between the developer who writes the code than does the call and the developer(s) who wrote the called function:

Caller developer is the same person as the callee developer. One person gets to decide how things are done,
Caller developer works on the same team as the callee developer. Here an interface needs to be negotiated with one or more other people,
Caller developer is making use of a third-party library. The interface is pre-decided, take it or leave it (the same principle applies to library updates, the caller has to decide whether to stay with the existing version or make the changes needed to upgrade to the new version; Android is the (in)famous example of a frequently changed library that developers are under strong pressure to continually adapt to).

Calls to functions in third-party libraries tend to follow stylized sequences, e.g., open_* occurs before read_*/write_* and close_* appears last. Most of the time a generator could get away not calling functions from third-party libraries, because the number of such calls is often small, but when they occur they had better look correct.

Needing to call a function written by another developer on the team opens up all sorts of possibilities: there is always the option of them modifying the function to make life easier for the new call, but the cost of modifying all the existing call sites (plus any associated call chains) may be too high, perhaps a new global variable can be used to communicate the desired information, or perhaps the proposed usage is a small cog in a large wheel and has to make do (or perhaps they don’t like the person asking for the change and reasons are invented for staying as-is).

Then there is correlation in time and space (this has a big impact on patterns of evolution, at least I think so; models of forest fires, i.e., growth, death, fires of various sizes creating space for new growth, is the obvious parallel):

The longer a function exists the more likely it is to accumulate callers. A function definition can remain unchanged and yet over time become more and more difficult to change.
It can be very expensive to make changes to an existing function definition when there are lots of calls to it.

What do measurements of code have to say? Almost nothing; existing studies mostly do weird things like treating a system’s call graph as-if it were a social network (using a currently trendy metaphor is good for getting papers published, even if the mapping is all wrong), plotting power law-like graphs and sprouting portentous nonsense.

What is really needed is measurements of forked systems; comparing systems derived from a common past will tell us how much natural variation exists due to individual choice.

FreeBSD, NetBSD and OpenBSD are the obvious poster children of forked, common heritage, systems written in C; I cannot think of any such systems written in Java or C++.

Categories: Uncategorized Tags: call graph, code generation, human behavior

Programming using genetic algorithms: isn’t that what humans already do ;-)

October 18, 2013 Derek Jones No comments

Some time ago I wrote about the use of genetic programming to fix faults in software (i.e., insertion/deletion of random code fragments into an existing program). Earlier this week I was at a lively workshop, Genetic Programming for Software Engineering, with some of the very active researchers in this new subfield.

The genetic algorithm works by having a population of different programs, selecting X% of the best (as measured by some fitness function), making random mutations to those chosen and/or combining bits of programs with other programs; these modified programs are fed back to the fitness function and the whole process iterates until an acceptable solution is found (or a maximum iteration limit is reached).

There are lots of options to tweak; the fitness function gets to decide who has children and is obviously very important, but it can only work with what get generated by the genetic mutations.

The idea I was promoting, to anybody unfortunate enough to be standing in front of me, was that the pattern of usage seen in human written code provides lots of very useful information for improving the performance of genetic algorithms in finding programs having the desired characteristics.

I think that the pattern of usage seen in human written code is driven by the requirements of the problems being solved and regular occurrence of the same patterns is an indication of the regularity with which the same requirements need to be met. As a representation of commonly occurring requirements these patterns are pre-tuned templates for genetic mutation and information to help fitness functions make life/death decisions (i.e., doesn’t look human enough, die!)

There is some noise in existing patterns of code usage, generated by random developer habits and larger fluctuations caused by many developers following the style in some popular book. I don’t have a good handle on estimating the signal to noise ratio.

There has been some work comparing the human maintainability of patches that have been written by genetic algorithms/humans. One of the driving forces behind this work is the expectation that the final patch will still be controlled by humans; having a patch look human-written like is thought to increase the likelihood of it being ‘accepted’ by developers.

Genetic algorithms are also used to improve the runtime performance of programs. Bill Langdon reported that the authors of a program ‘he’ had speeded up by a factor of 70 had not responded to his emails. This may be a case of the authors not knowing how to handle something somewhat off the beaten track; it took a while for Linux developers to start responding to batches of fault reports generated as part of software analysis projects by academic research groups.

One area where human-like might not always be desirable is test case generation. It is easy to find faults in compilers by generating random source code (the syntax/semantics of the randomness follows the rules of the language standard). This approach results in an unmanageable number of fault. Is it worth fixing a fault generated by code that looks like it would never be written by a person? Perhaps the generator should stick to producing test cases that at least look like the code might be written by a person.

Categories: Uncategorized Tags: faults, fitness, genetic algorithms, human behavior, maintainability

Newer Entries

The Shape of Code

Archive

Human vs automatically generated source code: an arms race?

Forces driving patterns found in call graphs of human written code

Programming using genetic algorithms: isn’t that what humans already do ;-)

Recent Posts

Recent Comments

Archives

Meta