October 20, 2024 Derek Jones No comments

What is an effective technique for modelling the relationship between the time estimated to implement a task and the actual time taken to implement that task?

A regression model is the obvious approach. However, an important assumption made by the commonly used regression techniques is not met by estimate/actual project data

The commonly used regression techniques involve two kinds of variables: the explanatory variable and the response variable (also known as the independent and dependent variables). For instance, in the equation y=1+2*x , is the explanatory variable and is the response variable.

When fitting a regression model to measurement data, the fitted equation is assumed to have the form such as: y=a+b*x+epsilon , where epsilon is uncertainty in the value of , with the valued assumed to have no uncertainty; and are constants fitted by the modelling process. The values returned by the model fitting process include an estimate for epsilon , as well as estimates for and .

When running an experiment, the values of the explanatory variables(e.g., ) are chosen by the experimenter, with the subject providing the value of the response variable, e.g., .

What does this technical detail have to do with estimation data?

The task estimate/actual values are both provide by the subject (i.e., the developer), there is no experimenter providing one of the values; in fact there is no experiment, these are measurements of things that happened. Both the estimate and actual are response variables, and both contain some amount of uncertainty, and the fitting process needs to take this into account. The appropriate regression technique to use for this case is an errors-in-variables model, which fits the equation y=a+b*(x+eta)+epsilon , with eta being the uncertainty in .

A previous post discussed the surprising behavior that can occur when failing to use errors-in-variables regression for where the data does not contain any explanatory variables, i.e., all the variables contain uncertainty.

The process of fitting an errors-in-variables regression model requires additional input, a value for eta has to be specified. Taking the example of task estimation, possible uncertainties in the estimate include: misunderstanding of the requirement(s), faded memory of the actual time previously taken by very similar tasks, an inaccurate model of developer skills, and a preference for using round numbers.

What data is available on the uncertainty of individual task estimates? I know of one study where, unknown to them, the individuals estimated the same task twice (in fact, seven people each estimated the same six distinct tasks twice, over a period of three-months). The plot below shows the first/second estimate made by each person for each of the six tasks, with the grey line showing where first==second estimate (code+data):

First and second implementation time estimates made by seven individuals for the same six tasks.

Assuming the estimation uncertainty in this experiment’s data is roughly equal to the estimation uncertainty in other estimation datasets, of tasks taking up to 20 hours, how might it be used to calculate a value for the uncertainty in estimated values?

Two possibilities include:

Assuming that the uncertainty in both the first and second estimates is equal, a model can be fitted using Deming regression (which treats both variables as having the same uncertainty), and the residual standard error of this model used as the value of . This value for a fitted multiplicative model is 0.6 (code+data),
using the mean of the relative errors, ${abs(Est_1-Est_2)}/Est_1$ ; its value is 0.55.

How different are the models built using linear regression and errors-in-variables regression, for small task estimates?

A basic linear regression model fitted to the SiP estimation dataset is: $Actual=1.06*Estimate^{0.87}$ .

Updating this model, using SIMEX, to take into uncertainty in the value of Estimate gives, for an uncertainty error of 0.55: $Actual=0.91*Estimate^{1.03}$ , and for an uncertainty error of 0.60: $Actual=0.89*Estimate^{1.07}$ . The coefficients for the two models are essentially the same (code+data).

The exponent value is the noticeable difference between the linear regression and errors-in-variables regression models. Adding the assumed amount of uncertainty (based on data from one experiment) to the estimated value leads to a model where estimate/actual are very close to having a linear relationship.

Is this errors-in-variables model any closer to reality than the linear regression model? The model shows that the estimate/actual relationship is closer to linear than was previously thought. Until more data becomes available, we won’t know how close this relationship actually is.

The people who made the estimates in the SiP data also performed the work that took the recorded actual time. Assigning a task to a different person could produce both a different estimate and a different actual, but these possible values are unknown. On a larger scale, different companies bidding on the same contract specify different amounts and have different implementations times; data showing these differences.

Categories: Uncategorized Tags: actual time, error analysis, estimating, regression model, uncertainty

The Shape of Code

Archive

Modelling estimate/actual including uncertainty in the estimate

Recent Posts

Recent Comments

Archives

Meta