Home > Uncategorized > A model of fault experiences for a single functionality program

A model of fault experiences for a single functionality program

Some programs perform one basic task, e.g., analyse input to calculate some quantity. These programs often take a set of input values and produce some output.

If a program contains a single coding mistake m_1 and the probability of producing incorrect output, for one set of inputs, is p_1, the probability of the d‘th output being incorrect is: p_1*(1-p_1)^(d-1). This is a geometric distribution (an exponential distribution is a good enough approximation). The value d can be thought of as the distance between incorrect outputs, measured in the number of distinct inputs correctly processed.

If the program contains a second, different, coding mistake, m_2 whose probability of producing incorrect output, for a given input, is p_2, the probability of incorrect output after d inputs is now: (p_1+p_2)*[(1-p_1)*(1-p_2)]^(d-1). And so on for each distinct coding mistake.

The value of p_1 is driven by the likelihood that the input values cause the program’s flow of control to reach the code containing the coding mistake, and then for the execution of the coding mistake to produce a value that percolates through the executed code to produce an incorrect output value. It might be said that users cause faults by providing the necessary input values.

Each p_i in the expression (p_1+p_2+...+p_n)*[(1-p_1)*(1-p_2)*(...)*(1-p_n)]^(d-1) is created by the n distinct coding mistakes, m_n.

In practice the number of distinct coding mistakes, n, is unknown, and a single probability is assigned to incorrect output: q=p_1+p_2+...+p_n, giving: q*(1-q)^(d-1) (the substitution (1-q)=(1-p_1)*(1-p_2)*(...)*(1-p_n) is a good enough approximation because the p_i are very small).

How accurate is this model in practice?

The analysis below uses data from my top, must read, paper on software fault analysis and a recent study of LLM driven N-version programming.

The N-Version Programming with Coding Agents tested 1-million inputs/outputs, and 14 of the programs in the published results contained 421 incorrect outputs (data kindly provided by Javier Ron). Before discussing the same incorrect counts, does the ‘distance’ between incorrect outputs have an exponential distribution?

This question can be answered using an exponential Q-Q plot (values predicted by theory on the x-axis, actual values on the y-axis). The plot below shows the actual values (blue/green) and a fitted regression line (data from running the claude_code__m_claude-haiku-4.5__l_rust__run000 generated code; code+data):

Distance between incorrect outputs: theory (x-axis) and practice (y-axis), with fitted regression line.

If the data had an exponential distribution the exponent of a fitted power law would be exactly 1. However, the fitted exponent is 1.07, and is statistically significant (for the sample size and standard error of the fit).

The above analysis assumes that the probability of incorrect output, q, is constant. In practice q varies across different input values, but over very many inputs is assumed to be clustered around an unchanging average value. It is possible that the input values are divided into two clusters whose incorrect output probabilities are q_1 and q_2, i.e., a mixture of two exponentials. The worst case scenario is a mixture of umpteen exponentials.

Fitting the distance data to a mixture of two exponentials gives: 0.1e^{-0.002}+(1-0.1)e^{-0.00037} (code+data). This corresponds to 10% of the input having an average distance between incorrect output of 482, and 90% having a distance of 2,735 (average 2,482), i.e., one cluster of input values is a lot more likely to produce incorrect output than the other cluster. In the published data the average distance is 2,385.

The specification used for the N-version studies came from a study by Nagel and Skrivan who tracked down each of the coding mistakes in the programs tested. They found that around 80% of incorrect outputs were caused by the same coding mistake, and 16% by a second coding mistake (see figure 4.3.7.1-1; an analysis of the Knight & Leveson coding mistakes). Once these two coding mistakes were fixed, other coding mistakes caused incorrect output.

For the LLM generated code, 14 out of the 59 programs had 421 incorrect outputs. Did the LLMs all generate code containing the same coding mistake? I have yet to track down any of the coding mistakes, or get an LLM to do it for me. Once a few coding mistakes have been fixed, ‘allowing’ other coding mistakes to produce incorrect output, will partially correct LLM generated programs stop having correlated failures?

How much does the number of incorrect output change when a different set of random inputs are used?

The random seed used in the published results is: 42. I replicated this output for all 56 programs (it takes around 8.5 hours on my system), and then ran the 1-million inputs on just four programs using the seeds: 101, 20101, 321, 789, and 6000 (which each took around 1-hour). The table below shows the number of incorrect outputs for the various seeds (each numeric column corresponds to a different seed):

        Generation information                 Number of incorrect outputs
claude_code__m_claude-haiku-4.5__l_rust      421   432   374  404  424   452
codex__m_gpt-5.1-codex__l_pascal             421   431   372  402  424   451
claude_code__m_claude-sonnet-4.6__l_pascal  1222  1224  1214 1237 1207  1282
codex__m_gpt-5.4-mini__l_python            10052 10063 10181 9944 9963 10122

Two (claude_code__m_claude-haiku-4.5__l_rust and codex__m_gpt-5.1-codex__l_pascal) of the 14 programs that had the same number of incorrect outputs with seed 42, had slightly different number of incorrect outputs with other seeds. This suggests that their coding mistakes are slightly different.

All the generated programs had some deviation from the pure single exponential model discussed above.

This simple model for ‘distance’ between incorrect outputs is a good fit to reality because the program performs one basic function and is relatively short (around 500 lines). Larger programs supporting a selection of functionality are going to require much more complicated models.

  1. No comments yet.
  1. No trackbacks yet.