April 14, 2024 Derek Jones No comments

Are programs written in some programming language shorter/longer, on average, than when written in other languages?

There is a lot of variation in the length of the same program written in the same language, across different developers. Comparing program length across different languages requires a large sample of programs, each implemented in different languages, and by many different developers. This sounds like a fantasy sample, given the rarity of finding the same specification implemented multiple times in the same language.

There is a possible alternative approach to answering this question: Compare the size of commits, in lines of code, for many different programs across a variety of languages. The paper: A Study of Bug Resolution Characteristics in Popular Programming Languages by Zhang, Li, Hao, Wang, Tang, Zhang, and Harman studied 3,232,937 commits across 585 projects and 10 programming languages (between 56 and 60 projects per language, with between 58,533 and 474,497 commits per language).

The data on each commit includes: lines added, lines deleted, files changed, language, project, type of commit, lines of code in project (at some point in time). The paper investigate bug resolution characteristics, but does not include any data on number of people available to fix reported issues; I focused on all lines added/deleted.

Different projects (programs) will have different characteristics. For instance, a smaller program provides more scope for adding lots of new functionality, and a larger program contains more code that can be deleted. Some projects/developers commit every change (i.e., many small commit), while others only commit when the change is completed (i.e., larger commits). There may also be algorithmic characteristics that affect the quantity of code written, e.g., availability of libraries or need for detailed bit twiddling.

It is not possible to include project-id directly in the model, because each project is written in a different language, i.e., language can be predicted from project-id. However, program size can be included as a continuous variable (only one LOC value is available, which is not ideal).

The following R code fits a basic model (the number of lines added/deleted is count data and usually small, so a Poisson distribution is assumed; given the wide range of commit sizes, quantile regression may be a better approach):

alang_mod=glm(additions ~ language+log(LOC), data=lc, family="poisson")
 
dlang_mod=glm(deletions ~ language+log(LOC), data=lc, family="poisson")

Some of the commits involve tens of thousands of lines (see plot below). This sounds rather extreme. So two sets of models are fitted, one with the original data and the other only including commits with additions/deletions containing less than 10,000 lines.

These models fit the mean number of lines added/deleted over all projects written in a particular language, and the models are multiplicative. As expected, the variance explained by these two factors is small, at around 5%. The two models fitted are (code+data):

$meanLinesAdded=78*language*LOC^{0.11}$ or $meanLinesAdded=17*language*LOC^{0.13}$ , and $meanLinesDeleted=57*language*LOC^{0.09}$ or $meanLinesDeleted=8*language*LOC^{0.15}$ , where the value of is listed in the following table, and LOC is the number of lines of code in the project:

                    Original          0 < lines < 10000
    Language     Added     Deleted     Added   Deleted
    C              1.0       1.0         1.0     1.0
    C#             1.7       1.6         1.5     1.5
    C++            1.9       2.1         1.3     1.4
    Go             1.4       1.2         1.3     1.2
    Java           0.9       1.0         1.5     1.5
    Javascript     1.1       1.1         1.3     1.6
    Objective-C    1.2       1.4         2.0     2.4
    PHP            2.5       2.6         1.7     1.9
    Python         0.7       0.7         0.8     0.8
    Ruby           0.3       0.3         0.7     0.7

These fitted models suggest that commit addition/deletion both increase as project size increases, by around $LOC^{0.1}$ , and that, for instance, a commit in Go adds 1.4 times as many lines as C, and delete 1.2 as many lines (averaged over all commits). Comparing adds/deletes for the same language: on average, a Go commit adds $78*1.4=109.2*LOC^{0.11}$ lines, and deletes $57*1.2=68.4*LOC^{0.09}$ lines.

There is a strong connection between the number of lines added/deleted in each commit. The plot below shows the lines added/deleted by each commit, with the red line showing a fitted regression model $deleted approx added^{0.82}$ (code+data):

Number of lines added/deleted by each of 3 million commits, with fitted regression line.

What other information can be included in a model? It is possible that project specific behavior(s) create a correlation between the size of commits; the algorithm used to fit this model assumes zero correlation. The glmer function, in the R package lme4, can take account of correlation between commits. The model component (language | project) in the following code adds project as a random effect on the language variable:

del_lmod=glmer(deletions ~ language+log(LOC)+(language | project), data=lc_loc, family=poisson)

It takes around 24hr of cpu time to fit this model, which means I have not done much experimentation…

Categories: Uncategorized Tags: code size, commit, data analysis, LOC

The Shape of Code

Archive

Average lines added/deleted by commits across languages

Recent Posts

Recent Comments

Archives

Meta