Ordinary Least Squares is dead to me
Most books that discuss regression modeling start out and often finish with Ordinary Least Squares (OLS) as the technique to use; Generalized Linear ModelLeast Squares (GLMS) sometimes get a mention near the back. This is all well and good if the readers’ data has the characteristics required for OLS to be an applicable technique. A lot of data in the social sciences has these characteristics, or so I’m told; lots of statistics books are written for social science students, as a visit to a bookshop will confirm.
Software engineering datasets often range over several orders of magnitude or involve low value count data, not the kind of data that is ideally suited for analysis using OLS. For this kind of data GLMS is probably the correct technique to use (the difference in the curves fitted by both techniques is often small enough to be ignored for many practical problems, but the confidence bounds and p-values often differ in important ways).
The target audience for my book, Empirical Software Engineering with R, are working software developers who have better things to do that learn lots of statistics. However, there is no getting away from the fact that I am going to have to make extensive use of GLMS, which means having to teach readers about the differences between OLS and GLMS and under what circumstances OLS is applicable. What a pain.
Then I had a brainwave, or a moment of madness (time will tell). Why bother covering OLS? Why not tell readers to always use GLMS, or rather use the R function that implements it, glm
. The default glm
behavior is equivalent to lm
(the R function that implements OLS); the calculation is not being done by hand but by a computer (i.e., who cares if it is more complicated).
Perhaps there is an easy way to explain this to software developers: glm
is the generic template that can handle everything and lm
is a specialized template that is tuned to handle certain kinds of data (the exact technical term will need tweaking for different languages).
There is one user interface issue, models built using glm
do not come with an easy to understand goodness of fit number (lm
has the R-squared value). AIC is good for comparing models but as a single (unbounded) number it is not that helpful for the uninitiated. Will the demand for R-squared be such that I will be forced for tell readers about lm
? We will see.
How do I explain the fact that so many statistics books concentrate on OLS and often don’t mention GLMS? Hey, they are for social scientists, software engineering data requires more sophisticated techniques. I will have to be careful with this answer as it plays on software engineers’ somewhat jaded views of social scientists (some of whom have made very major contribution to CRAN).
All the software engineering data I have seen is small enough that the performance difference between glm
/lm
is not a problem. If performance is a real issue then readers will search the net and find out about lm
; sorry guys but I want to minimise what the majority of readers need to know.
Short comment: when I first read GLS, I thought of Generalized Least Squares, which is also linked to in the above article. It took me some time to understand that you are talking about GLM, i.e. generalized linear models, which at least for me is something different. Maybe it might be worthwhile to make the distinction clearer, but maybe we have been “raised” with different terms for different things.
@Lama
Fixed. Thanks for pointing out my ‘least squares’ fixation.
Or you could feed your fixation by calling them IRLS (iterated re-weighted least squares) models since that’s how they’re usually fitted. For an extra bonus, OLS is a special case of IRLS with fixed weights and one iteration. Now, that would be silly, but no sillier than calling a linear model OLS.
By the way, ch 3 of Angrist and Pischke (2009) http://www.mostlyharmlesseconometrics.com/ argues that ‘OLS’ is in any case always ‘applicable’.
http://en.wikipedia.org/wiki/Generalized_least_squares
As already mentioned above GLS is not GLM.
I completely agree that GLM is superior, however for large data sets IWLS algorithm for approximation can be a bit overkill for simple problems.
In contrast LM has a nice analytic solution and should be used whenever appropriate.
I recently started working with excel/VBA. Work right? I needed some improvements on LINEST function and wrote a nice LM function in an hour giving all I need. In contrast I tried implementing IWLS and gave up as I am not too keen on reading books on that as it would consume a lot of time. Also tried coppying code from R, but I ended up at fortran code.
I agree with you that GLM is more general case and I admire binomial / logit models that are simple(been developed ages ago) yet powerful. What I disagree is that somebody should be taught GLM without understanding LM.
Also, distributions of GLM estimates are asymptotic, what means only for a large number of observations, while LM has a nice t-distributions, that work for small number of observations.
Regards
@Dominykas Grigonis
I suspect that size will not be an important issue for software engineering data for a few years. Readers are more likely to encounter size issues when they try to apply techniques from the book to non-software engineering domains, e.g., the application they are working on.
Would there be so much discussion about LM if it did not have the history it has? Surely it is better to teach one technique that will work everywhere (ok, not mixed models, time series, …) and leave learning about other techniques to the small percentage of people who require them.
@Will
Are Angrist and Pischke really saying “always ‘applicable'” to Econometrics datasets?
The nice feature of OLS is that the estimator for the coefficient vector does not depend on assumptions about the variance. If you fit a probit/logit model (or in general a non-linear model), you often need to make such an assumption, and if that one happens to be wrong, it is difficult to know if your “beta” still means what you think it means. The same applies to GLS: you need an estimate for your variance-covariance matrix. If your estimate for this matrix is wrong, you may not know how your coefficients of interests are affected. These issues are very important when you talk about glms, and while there are ways to address them, a failure of these assumptions or a failure to incorporate them appropriately might crucially affect your parameter estimates (in ways you might not know). This is not an issue in the linear case, where a failure of homoskedasticity assumptions merely affects inferential statements, but not the estimated coefficients themselves. I am not sure how relevant heteroskedasticity issues are for engineering, but for social scientists, it definitely and almost always is an issue, which is why many people prefer OLS (at least when the outcome is continuous, and some even when it is not).
Derek,
You might enjoy the following reference from Stephen Stigler, who argues that “Least squares is still and will remain King” due to the seemingly “magical” properties. Don’t count it out yet! https://files.nyu.edu/ts43/public/research/Stigler.pdf
@Lama
Most software engineers have a long way to go before they start to be interested in things like covariance matrices. I am looking to upgrade people to use constructs like:
glm(..., family=gaussian(link="log"))
andglm(..., family=poisson)
.I’m still not sure what to say about non-linear model building, mainly because I have not had to use it much (so-called power laws relationships seems to be the main potential use; most don’t stand up).
Nope. They’re saying they’re always applicable. Seriously, read the chapter. Also, there’s no such thing as an econometrics dataset.
As everyone above has commented, but i’ll try to summarize,
To avoid confounding, it would be nice to make the clear distinction between the following 3 ideas.
1) The Model: eg Generalized Linear Model, Normal Linear Model
2) The Estimator: eg MLE, Generalized Least Squares, OLS
3) The Estimand: eg Coefficients of a model, Response to be predicted
Rod Little Has championed this useful classification quite extensively
@Mike
Thank you for this very useful summary. I am guilty of confounding these distinct issues into
lm
vsglm
.Some of the commenters have concentrated on this confounding rather than my central point that it is more efficient to teach one solution that can be widely used than two solutions where one may be fast/(lovely maths)/misleading (at least for the analysis of software engineering data). Yes, the attention getting title did not help (I am prone to these).