Readability: we know nothing
Readability is one of those terms that developers use and expect other developers to understand while at the same time being unable to define what it is or how it might be measured. I think all developers would agree that their own code is very readable; if only different developers stopped writing code in different ways the issue would go away 🙂
Having written a book containing lots of material on cognitive psychology and how it might apply to programming, developers who have advanced beyond “Write code like me and it will be readable” sometimes ask for my perceived expert view on the subject. Unfortunately my expertise has only advanced to the stage of: 1) having a good idea of what research questions need to be addressed, 2) being able to point at experimental results showing that most claimed good readability tips are at best worthless or may even increase cognitive load during reading.
To a good approximation we know nothing about code readability. What questions need to answered to change this situation?
The first and most important readability question is: what is the purpose of looking at the code? Is the code being read to gain understanding (likely to involve ‘slow’ and deliberate behavior) or is the reader searching for some construct (likely to involve skimming; yes, slow and deliberate is more accurate but people make cost/benefit decisions when deciding which strategies to use. The factors involved in reader strategy selection is another important question)?
Next we need to ask what characteristics of developer performance are expected to change with different code organization/layouts. Are we interested in minimizing error, minimizing the time taken to achieve the readers purpose or something else?
What source code attributes play a significant role in readability? Possibilities include the order in which various constructs appear (e.g., should variable definitions appear at the start of a function or close to where they are first used), variable names and the position of tokens relative to each other when viewed by the reader.
Questions involving the relative position of tokens probably generates the greatest volume of discussion among developers. To what extent does visual organization of source code affect reader performance? Fluent reading requires a significant amount of practice, perhaps readable code is whatever developers have spent lots of time reading.
If there is some characteristic of the human visual system that generates a worthwhile benefit to splitting long lines so that a binary operator appears at the {end of the last}/{start of the next} line, will it apply the same way to all developers? We could end up developers having to configure their editor so it displays code in a form that matches the characteristics of their visual system.
How might these ‘visual’ questions be answered? I think that eye tracking will play a large role (“Eyetracking Web Usability” by Jakob Nielsen and Kara Pernice is a good read). At the moment there are technical/usability issues that make this kind of research very difficult. Eye trackers capable of continuously supporting enough resolution to know which character on the screen a developer is looking at (e.g., EyeLink 1000) require that the head be held in a fixed position, while those allowing completely free head movement (e.g., S2 Eye Tracker) don’t yet continuously support the required resolution.
Of course any theory derived from eye tracking experiments will still have to be validated by measuring developer performance on various code snippets.
You might enjoy the article here that uses Buse’s data to find that mean readability scores are related to the information content of the snippets themselves:
http://readability.softwareprocess.es
Thanks for the reference to an interesting paper which I have only had time to skim. Buse’s readability experiment asked subjects to rate the readability of code snippets without giving them an algorithm to use; presumably they had previously encountered the issue of readability during their course (they were students). The analysis of the experimental data is actually an analysis of the algorithms used by subjects during the experiment; these algorithms may or may not have some connection to what readability is (even in a supposed simple experiment subjects can behave in ways that are complete different than expected).
Given short snippets of code and asked to rate for readability what kind of activities might subjects engage in? They are likely to carefully read the snippet, so we would expect a correlation with number of tokens and the easy with which information could be extracted from identifiers (at a recent workshop I suggested to Buse’s PhD supervisor that the snippets should be manually constructed, rather than extracted from real code, as this would allow things like variable names and number of tokens to be varied in a controlled way).
It would be interesting to rerun the Buse experiment to obtain timing information. Is there any correlation between subject’s readability rating and the time they spent reading a snippet and providing a rating?
I find it hard to believe that you can really test whether a given readability technique is effective and actually control for all the various confounding factors that come from asking developers whether something is readable or not – are we controlling for the relative experience/skill/opinions/biases of the people? Are the sample sizes big enough even so? Is the example code really representative of the use of a given tip/rule of thumb, or two small to be sure? Perhaps these things can be handled (I am not a statistician), and are handled in the studies, in which case please enlighten me 🙂
Of course you could say the same about any given readability tip, how on earth do you know it makes things more readable? Having said that, you can always contrive an example to demonstrate that things *can* become unreadable, e.g. go through a large program and rename all the variables to random 64-length strings – I think anybody would then agree that was unreadable, so you can set boundaries on the problem, even if they are quite wide.
There are several interesting things at play here; and one thing to note is how hard it is to define readability in papers in general – there are still new types of metrics coming out to define them. And that is perhaps the same kind of thing needed here – something like the Fleshmann Scale for code and advising a general level target. Nothing will be perfect, but that kind of thing will certainly be necessary.
I do think that tools like GNU Indent that enables one to format the code to ones liking and to an organizational standard will certainly be necessary as some people are able to manage better with different spacing constructs than others. (For example, I prefer my block braces on separate lines and indented to denote the block; while many like the opening brace at the end of the line creating the block, and the ending brace on its own line indented or not.)
Regardless, I agree this is a field that needs a lot of research and study, and will be one in helping the software field mature into the rigorous discipline it needs to be.
@TemporalBeing , rather than trying to define what readability is perhaps the effects it is thought to have should be investigated. Readability is generally claimed to influence developer performance; two possible performance effects are error reduction (exactly what kind of errors remains to be seen) and reduction in time to complete a task.
Experiments measuring developer performance for various kinds of ‘readable’ vs ‘nonreadable’ code will at least give us some idea of the percentage difference in performance levels involved. If the difference is small researchers can move on to another topic leaving a few fanatic behind in a search for the Holy Grail.
Running different experiments using the same subjects would provide information on developer differences. When comparing X and Y ideally we would want most subjects to be better with one and worse with the other. If 50% of subjects improved with one and the other 50% improved with the other. There would have to be a significant benefit if it were necessary to deal with major readability performance difference between different developers.