Home > Uncategorized > Analysis of when refactoring becomes cost-effective

Analysis of when refactoring becomes cost-effective

In a cost/benefit analysis of deciding when to refactor code, which variables are needed to calculate a good enough result?

This analysis compares the excess time-code of future work against the time-cost of refactoring the code. Refactoring is cost-effective when the reduction in future work time is less than the time spent refactoring. The analysis finds a relationship between work/refactoring time-costs and number of future coding sessions.

Linear, or supra-linear case

Let’s assume that the time needed to write new code grows at a linear, or supra-linear rate, as the amount of code increases (1 <= x):

C=B+k_1{L_c}^x

where: B is the base time for writing new code on a freshly refactored code base, L_c is the number of lines of code that have been written since the last refactoring, and k_1 and x are constants to be decided.

The total time spent writing code over n sessions is:

T=nB+k_1sum{i=1}{n}{(iL_i)^x}

If the same number of new lines is added in every coding session, L_s, and x is an integer constant, then the sum has a known closed form, e.g.:

x=1, sum{i=1}{n}{(nL_s)^1}={n(n+1)}/2L_s; x=2, sum{i=1}{n}{(nL_s)^2}={n(n+1)(2n+1)}/6{L_s}^2

Let’s assume that the time taken to refactor the code written after n sessions is:

R=k_2(nL_s)^y

where: k_2 and y are constants to be decided.

The reason for refactoring is to reduce the time-cost of subsequent work; if there are no subsequent coding sessions, there is no economic reason to refactor the code. If we assume that after refactoring, the time taken to write new code is reduced to the base cost, B, and that we believe that coding will continue at the same rate for at least another f sessions, then refactoring existing code after n sessions is cost-effective when:

k_2(nL_s)^y < k_1sum{i=n+1}{n+f}{(iL_s)^x}

assuming that f is much smaller than n, setting y=x+c, and rearranging we get:

k_2/k_1 < {L_s}^x/{{L_s}^x{L_s}^c}fn^x/{{n^x}n^c}

after rearranging we obtain a lower limit on the number of future coding sessions, f, that must be completed for refactoring to be cost-effective after session n::

k_2/k_1 {L_s}^c n^c< f

It is expected that k_1 < k_2; the contribution of code size, at the end of every session, in the calculation of C and R is equal (i.e., {L_c}^x=(nL_s)^y), and the overhead of adding new code is very unlikely to be less than refactoring all the newly written code.

With 1 < k_2/k_1, c must be close to zero; otherwise, the likely relatively large value of L_s (e.g., 100+) would produce surprisingly high values of f.

Sublinear case

What if the time overhead of writing new code grows at a sublinear rate, as the amount of code increases?

Various attributes have been found to strongly correlate with the log of lines of code. In this case, the expressions for C and R become:

C=B+k_1 log{L_c}
R=k_2 log(nL_s)

and the cost/benefit relationship becomes:

k_2 log(nL_s) < k_1sum{i=n+1}{n+f}{log(iL_s)}

applying Stirling’s approximation and simplifying (see Exact equations for sums at end of post for details) we get:

k_2(log{n} +log{L_s}) < k_1(f(log(n+f)-1)+f log{L_s})

{k_2}/{k_1} {log{n} +log{L_s}}/{log(n+f)+log{L_s}-1} < f

applying the series expansion (for 1<x): x/{x-1} right 1+1/x+1/{x^2}+1/{x^3}..., we get

{k_2}/{k_1} (1+1/{log{n} +log{L_s}}) < f

Discussion

What does this analysis of the cost/benefit relationship show that was not obvious (i.e., the relationship {k_2}/{k_1} < f is obviously true)?

What the analysis shows is that when real-world values are plugged into the full equations, all but two factors have a relatively small impact on the result.

A factor not included in the analysis is that source code has a half-life (i.e., code is deleted during development), and the amount of code existing after n sessions is likely to be less than the nL_s used in the analysis (see Agile analysis).

As a project nears completion, the likelihood of there being f more coding sessions decreases; there is also the every present possibility that the project is shutdown.

The values of k_2 and k_1 encode information on the skill of the developer, the difficulty of writing code in the application domain, and other factors.

Exact equations for sums

The equations for the exact sums, for x=1,2,3,0.5, are:

sum{i=n+1}{n+f}{i^1}=f/2(2n+f+1)
sum{i=n+1}{n+f}{i^2}=f/6(6n^2+6n+2f^2+f(6n+3)+1)
sum{i=n+1}{n+f}{i^3}=f/4(2n+f+1)(2n(n+1)+2fn+f+f^2)
sum{i=n+1}{n+f}{sqrt{i}}=zeta(-0.5,n+1)-zeta(-0.5, f+n+1), where zeta is the Hurwitz zeta function.

Sum of a log series: sum{i=n+1}{n+f}{log{iL_s}}=log{{(n+f)!}/{n!}}+f log{L_s}
using Stirling’s approximation we get
log{((n+f)!)}-log(n!) approx (n+f-0.5)log(n+f)-(n+f)-((n-0.5)log n-n)
simplifying
log{((n+f)!)}-log(n!) approx (n-0.5)log(1+f/n)+f log(n+f)-f
and assuming that f is much smaller than n gives
log{((n+f)!)}-log(n!) approx f(log(n+f)-1)

  1. No comments yet.
  1. No trackbacks yet.