The Shape of Code

About

Home > Uncategorized > Modeling the distribution of method sizes

Modeling the distribution of method sizes

October 12, 2025 Derek Jones Leave a comment Go to comments

The number of lines of code in a method/function follows the same pattern in the three languages for which I have measurements: C, Java, Pharo (derived from Smalltalk-80).

The number of methods containing a given number of lines is a power law, with an exponent of 2.8 for C, 2.7 for Java and 2.6 for Pharo.

This behavior does not appear to be consistent with a simplistic model of method growth, in lines of code, based on the following three kinds of steps over a 2-D lattice: moving right with probability , moving up and to the right with probability , and moving down and to the right with probability . The start of an if or for statement are examples of coding constructs that produce a step followed by a step at the end of the statement; steps are any non-compound statement. The image below shows the distinct paths for a method containing four statements:

Number of distinct silhouettes for a function containing four statements

For this model, if U < D the probability of returning to the origin after taking is a complicated expression with an exponentially decaying tail, and the case U = D is a well studied problem in 1-D random walks (the probability of returning to the origin after taking steps is $P(n) approx n^{-1.5}$ ).

Possible changes to this model to more closely align its behavior with source statement production include:

include terms for the correlation between statements, e.g., assigning to a local variable implies a later statement that reads from that variable,
include context terms in the up/down probabilities, e.g., nesting level.

Measuring statement correlation requires handling lots of special cases, while measurements of up/down steps is easily obtained.

How can / probabilities be written such that step length has a power law with an exponent greater than two?

ChatGPT 5 told me that the Langevin equation and Fokker–Planck equation could be used to derive probabilities that produced a power law exponent greater than two. I had no idea had they might be used, so I asked ChatGPT, Grok, Deepseek and Kimi to suggest possible equations for the / probabilities.

The physics model corresponding to this code related problem involves the trajectories of particles at the bottom of a well, with the steepness of the wall varying with height. This model is widely studied in physics, where it is known as a potential well.

Reaching a possible solution involved refining the questions I asked, following suggestions that turned out to be hallucinations, and trying to work out what a realistic solution might look like.

One ChatGPT suggestion that initially looked promising used a Metropolis–Hastings approach, and a logarithmic potential well. However, it eventually dawned on me that $U approx (y/{y+1})^a$ , where is nesting level, and some constant, is unlikely to be realistic (I expect the probability of stepping up to decrease with nesting level).

Kimi proposed a model based on what it called algebraic divergence:

$R(y)=r/{z(y)},U(y)={u_0y^{1-2/{alpha}}}/{z(y)}, D(y)={d_0y^{1-2/{alpha}}}/{z(y)}$

where: z(y) normalises the probabilities to equal one, $z(y)=r+u_0y^{1-2/alpha}+d_0y^{1-2/alpha}$ , u_0 is the up probability at nesting 0, d_0 is the down probability at nesting 0, and alpha is the desired power law exponent (e.g., 2.8).

For C, alpha=2.8 , giving $R(y)=r/{z(y)},U(y)={u_0y^{0.29}}/{z(y)}, D(y)={d_0y^{0.29}}/{z(y)}$

The average length of a method, in LOC, is given by:

$E[LOC]={alpha r}/{2(d_0-u_0)}+O(e^{lambda}-1)$ , where: $lambda={2(d_0-u_0)}/{d_0+u_0}$

For C, the mean function length is 26.4 lines, and the values of , u_0 , and d_0 need to be chosen subject to the constraint r+u_0+d_0=1 .

Combining the normalization factor z(y) with the requirement u_0 < d_0 , shows that as increases, U(y) slowly decreases and D(y) slowly increases.

One way to judge how closely a model matches reality is to use it to make predictions about behavior patterns that were not used to create the model. The behavior patterns used to build this model were: function/method length is a power law with exponent greater than 2. The mean length, E[LOC] , is a tuneable parameter.

Ideally a model works across many languages, but to start, given the ease of measuring C source (using Coccinelle), this one language will be the focus.

I need to think of measurable source code patterns that are not an immediate consequence of the power law pattern used to create the model. Suggestions welcome.

It’s possible that the impact of factors not included in this model (e.g., statement correlation) is large enough to hide any nesting related patterns that are there. While different kinds of compound statements (e.g., if vs. for) may have different step probabilities, in C, and I suspect other languages, if-statement use dominates (Table 1713.1: if 16%, for 4.6% while 2.1%, non-compound statements 66%).

Categories: Uncategorized Tags: C, ChatGPT, function size, Java, Kimi, LLM, LOC, modeling, Pharo

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Finding links between gcc source code and the C Standard Early research on economies of scale for computer systems

The Shape of Code

Modeling the distribution of method sizes

Recent Posts

Recent Comments

Archives

Meta