Explaining the decline of German comments in LibreOffice
The LibreOffice suite of office programs traces it origins back to StarWriter, first launched in 1985 by a German company. Given its German parentage it’s no surprise that the source code contains comments written in German.
There has been a push by the LibreOffice folks to convert the German comments to English comments. The plot below shows the number of German comments in the source of LibreOffice over time (release version numbers at the top and red line is the least squares fit of an exponential; code and data).
I am not only surprised to see such a regular decline in the number of German comments, but also that the decline is exponential.
The pattern of behavior may be driven by those doing the work:
- people may be motivated by the number of remaining German comments; as the number decreases, people may be less likely to think it is worthwhile converting what is left,
- perhaps those doing the conversion say to themselves: “I will do x% of the comments and then stop”. Having decided on this approach, there would have to be some form of signaling to other involved parties, otherwise the rate of decline would not be so smooth.
Perhaps the issue is the skill required to convert the comments:
- perhaps many comments are easy to convert, with the conversion process getting progressively harder, e.g., exponentially so with those doing the conversion have roughly the same conversion skill level,
- alternatively the skill required to convert the comments is roughly the same, but the number of people, of those doing the work, with a given skill level is an exponential.
I find it hard to believe any of these mechanisms. Suggestions for easier to believe mechanisms welcome.
I’m inclined to agree with the skill level items having an impact to some degree. There may be other factors at play, such as the location of comments. If there are particular “hotspot” places in the code which are frequently being modified it’s more likely more people read the associated comments and understand the changing code. The comments may be more likely to be changed at these locations and, assuming most developers are aware of the new commenting style guide, more likely be changed to English. As time goes by only comments associated with infrequently modified code remains. The regularity of the decline is still curious.
Here’s a hypothesis: The comments in “core” sections of the code were quickly spotted and translated / replaced. Comments in side modules were re-written as people worked through them, but fewer people work with this code. Comments in the fringes are rarely seen; so nobody even thinks to translate them.
Due to their obscurity, a few comments will likely survive until deletion.
Since German is a mostly intelligible subset of English, a few comments will survive “forever” because only a computer would flag them. ;P
Joking aside, the quality of the fit is exceptional. Did you try several fits against data on multiple projects, until one stood out? (i.e. this might happen by chance) Or did you have a hunch first and were merely surprised by how well it held? Are there other datasets that could reproduce this exponential behavior?
@D Herring
A script has been written to detect and list the German comments. I had assumed that lots of people were working to remove the comments. Perhaps this is not so.
I like the idea that comments are rewritten when the code containing them is changed. This would explain the consistent decline over time. The exponential decline would fit what is approximately known about how code changes over time.
This is the only project I know of that is changing the language of a subset of its comments.