Data analysis with a manual mindset
A lot of software engineering data continues to be analysed using techniques designed for manual implementation (i.e., executed without a computer). Yes, these days computers are being used to do the calculation, but they are being used to replicate the manual steps.
Statistical techniques are often available that are more powerful than the ‘manual’ techniques. They were not used during the manual-era because they are too computationally expensive to be done manually, or had not been invented yet; the bootstrap springs to mind.
What is the advantage of these needs-a-computer techniques?
The main advantage is not requiring that the data have a Normal distribution. While data having a Normal, or normal-like, distribution is common is the social sciences (a big consumer of statistical analysis), it is less common in software engineering. Software engineering data is often skewed (at least the data I have analysed) and what appear to be outliers are common.
It seems like every empirical paper I read uses a Mann-Whitney test or Wilcoxon signed-rank test to compare two samples, sometimes preceded by a statement that the data is close to being Normal, more often being silent on this topic, and occasionally putting some effort into showing the data is Normal or removing outliers to bring it closer to being Normally distributed.
Why not use a bootstrap technique and not have to bother about what distribution the data has?
I’m not sure whether the reason is lack of knowledge about the bootstrap or lack of confidence in not following the herd (i.e., what will everybody say if my paper does not use the techniques that everybody else uses?)
If you are living on a desert island and don’t have a computer, then you will want to use the manual techniques. But then you probably won’t be interested in analyzing software engineering data.
Recent Comments