StatsModels: the first nail in R’s coffin
In 2012, when I decided to write a book on evidence-based software engineering, R was the obvious system to use for data analysis. At the time, lots of new books had “using R” or “with R” added at the end of their titles; I chose “using R”.
When developers tell me they need to do some statistical analysis, and ask whether they should use Python or R, I tell them to use Python if statistics is a small part of the program, otherwise use R.
If I started work on the book today, I would till choose R. If I were starting five-years from now, I could be choosing Python.
To understand why I think Python will eventually take over the niche currently occupied by R, we need to understand the unique selling points of both systems.
R’s strengths are that it supports a way of thinking that is a good fit for doing data analysis and has an extensive collection of packages that simplify the task of applying a wide variety of analysis techniques to data.
Python also has packages supporting the commonly used data analysis techniques. But nearly all the Python packages provide a developer-mentality interface (i.e., they provide an API like any other package), R provides data-analysis-mentality interfaces. R supports a way of thinking that data analysts can identify with.
Python’s strengths, over R, are a much larger base of developers and language support for writing large programs (R is really a scripting language). Yes, Python has a package ecosystem supporting the full spectrum of application domains, this is not relevant for analysing a successful invasion of R’s niche market (but it is relevant for enticing new developers who are still making up their mind).
StatsModels is a Python package based around R’s data-analysis-mentality interface. When I discovered this package a few months ago, I realised the first nail had been hammered into R’s coffin.
Yes, today R has nearly all the best statistical analysis packages and a large chunk of the leading edge stuff. But packages can be reimplemented (C code can be copy-pasted, the R code mapped to Python); there is no magic involved. Leading edge has a short shelf life, and what proves to be useful can be duplicated; the market for leading edge code in a mature market (e.g., data analysis) is tiny.
A bunch of bright young academics looking to make a name for themselves will see the major trees in the R forest have been felled. The trees in the Python data-analysis-mentality forest are still standing; all it takes is a few people wanting to be known as the person who implemented the Python package that everybody uses for XYZ analysis.
A collection of packages supporting the commonly (and eventually not so commonly) used data analysis techniques, with a data-analysis-mentality interface, removes a major selling point for using R. Python is a bigger developer market with support for many other application domains.
The flow of developers starting out with R will slow down, casual R users will have nothing to lose from trying out another language when the right project comes along (another language on the CV looks good and Python is a bigger market). There will be groups where everybody uses R and will continue to use R because that is what everybody else in the group uses. Ten-Twenty years from now R, developers could be working in a ghost town.
I fiddled around with Python for awhile. Found it frustrating: too much like old-style programming, which I hate. As far as I know there isn’t anything comparable to RStudio for Python (is there?). Native documentation for R is pretty bad but–and this is just my opinion–it’s even worse for Python. I still have Python installed on my machine but I won’t go any further unless it can really be shown that ascending the learning curve–which is steep–is worth the time and effort.
@Douglas Skinner
Yes, at the moment data analysis in Python is traditional programming, but that appears to be changing (hence this post).
There are umpteen IDEs for Python, IPython/Jupyter probably has the most brand name recognition.
Python general documentation is not great, but is better than R’s, but R easily wins the how to do ‘statistics in’ documentation category.
It is still early days. I would wait a year or two before checking out Python again.
I use both R and Python. I think it is still too soon to tell which will come out on top. Other than having to remind myself whether I should use len() or length(), I am generally happy with both. Each language has its strengths and its share of problems. Each language has a user community that is generally helpful to new people. I especially like the #rtats twitter feed for keeping up to date.
The problem comes in support for the core. The recent resignation of Python’s originator and BFDL because of infighting over the latest release highlights the need for support for infrastructure, having codes of conduct, and the a process to of get big teams to agree without wanting to strangle each other…
@John R Minter
Language use is a winner take-all game. What does R have that cannot be duplicated in Python (e.g., C provides the ability to get close to the metal, something that Python cannot duplicate)? Will R pivot to become a general purpose language (those involved have shown every sign of having little interest in things outside of data analysis)?
Why do languages evolve? I think it’s because people like to add their own stuff.
I think there will always be Python variants. R is unusual in not having any common variants, most languages have variant implementations (Perl is another example of a language that doesn’t).
These days language ecosystems are driven by packages. What features a language supports is becoming less and less relevant.
Will people really want algorithms re-implemented in new languages with all the risks of bugs etc?
What about a hybrid, calling R from Python and Python from R (reticulate package in R)?
@Nick F
Yes, developers love to reimplement in their favorite language, and yes, there will be bugs. The end user gets what they are given.
Supporting two intermixed runtime systems would be a nightmare, and is only done in practice when there is no other choice (e.g., source code is not available).
Well … I think it happens sooner than five years, and it’s not Python. It’s TensorFlow called from any language, just like there are Jupyter kernels for every language.
R was always a niche language – I’ve been writing R since 0.90 and I *still* have to do bash coding occasionally. I’ve never learned Python and I probably haven’t written Perl in five years, but TBH, without the RStudio / R Markdown ecosystem I would have abandoned R about five years ago.
What I found interesting is that statModels groups all the different regression/classification models in one big package (and neatly divided in subpackages) and the html documentation is very clear with a nice TOC linking to all those different types of models. So you can start exploring different models that you might not even have known about beforehand. In R you already have to know all the types of models and know in which package they live (usually with very bad non-descriptive name) and then go through the vignettes which usually aren’t a big help either. To know which types of models there are you have to go through a bunch of stats books or blog posts. All this means that this is very scattered. In Python it’s all neatly centralized and documented. That’s a huge win over R!
R could benefit from a statsModels package that groups all the different packages and with clear online documentation just like Hadley is doing with the tidyverse and his online docs/books.
@Steven Sagaert
StatsModels has the benefit of learning from the R experience.
R has been remarkable stable, compared to most other language systems. This is good from the point of view of ‘old’ code continuing to work, but lack of evolution does leave things stuck in the past (I tell people that
plot
output is produced by a time-machine from the 1970s).A system designed and implemented by statisticians has produced a great user interface for their needs, but ignoring the infrastructure for so long is far from good.
As an R package developer, what I find frustrating with Python is that if I have C code behind my R package and want to develop a front end to that C code for Python and package it up, I have to create it for a specific Python+C installation. Then only other users with that specific Python+C installation can use the package. The huge benefit that R has for it is the ease of package development and distribution (huge thanks to the CRAN maintainers!). Python has a good start with PyPy but wanting to integrate low-level code in your package and distribute it is a nightmare.
@Rebecca K
A recent attempt to add the stable distribution to SciPy taught me how awkward it is to add C-based code to a Python package.
The advantage of R’s single implementation is that there is one way of doing things. The advantage of Python’s multiple implementations is that competition improves the ecosystem and provides choice.
There are downsides to both approaches, e.g., your problem with Python using C in packages.
Statsmodels can barely do a t test. I wrote a video course describing statistics with Python and was so surprised at how little statsmodels can do.
Even the big packages like Scikit-Learn are getting detractors. Academics tend to write the R packages so the packages have that extra credibility. Sometimes you find things in Scikit-Learn and no one knows what the theoretical basis for it is, if there even is one.
@Curtis Miller
Yes, the state of statistical analysis in the current Python universe is rather lamentable. StatsModels is a step in the right direction.
This is why I would recommend R today. But in five years time? All it takes is a hand-full of knowledgeable people who are willing to put in the effort.
Reimplement is too kind. Python just COPIES from R. Yes, I said COPY. From Pandas to Numpy to Matplotlib, there is not one single function that Python came up with. When you read Pandas’ doc, it justifies itself by saying that’s how R did it.
You can’t fine Python for infringement and copyright, since R is open source. How soon Python will take over R depends on how fast Python will copy R — it’s just a matter of time.
Note: the weirdest captcha box you have, it’s more like a gotcha box.
@Mike
Your complaint about Python copying R echoes the complaints that proprietary vendors make about open source copying them (e.g., like when R copied from S). The R designers have come up with some great solutions, they should be copied.
The weird captcha is a necessary annoyance to stop spammers (who easily crack the simpler ones).
As jobbing academic the number one reason for me to use R instead of Python is that in RStudio I have a very easy to install and portable environment that the entire lab can install themselves. Python installations are still a pain. My computers are a mess with slightly different python installations. Anaconda a step forward for sure but does not do the opencv/dlib combo yet AFAIK. It’s all very fragile feeling.
@Mike and Derek
A correction to the copying comment.
statsmodels, pandas and the rest of the data analysis Python stack, and most of the scientific Python stack is BSD/MIT licensed which does NOT allow to copy code from GPL licensed packages. So code is **reimplemented** from documentation, examples to have a similar API and set of functionality, where R or other packages have a good structure already. In a similar approach, in the early days of the scientific python stack with numpy, scipy and matplotlib, many functions where designed in similarity to MATLAB without translating any proprietary code.
There are a few exceptions when R package authors give explicit permission to translate code from their package into Python.
Overall, this slows down the development for statistics in Python (BSD) and Julia (mostly MIT) because reusing the code developed for R is NOT allowed and code needs to be implemented from scratch.
R is useful for documentation and unit tests because it is open source and widely available, but referring to R functions and using reference numbers in unit tests computed in R does not constitute copying.
specific to statsmodels, as a developer and maintainer from it:
I am a big fan of the consistent interface to models in Stata in contrast to R’s largely inconsistent collection of functions and packages, see Steven Sagaert comment above.
Besides Stata, I also regularly consult the documentation of SAS, SPSS, NCSS/PASS and specific packages like GPower+ to see what functions should be implemented in statsmodels and what option and interface might be appropriate.
Example: sandwich covariances in statsmodels are built into the models directly and all further results, like wald tests use it. This is similar to Stata but different from R which requires separate packages and using the created covariance matrix explicitly in any hypothesis tests.
@Derek Jones
when I had to use python as developer I miss those the most: lazy evaluation, computing on the language, vector as a primary object
the last one I can easily workaround with loops, but what about first two?
@Josef
Thanks for the background information.
In the comment I was using copying in the sense of copying ideas, but in the main article I was using it in the sense of copying code (or at least creating a direct mapping). As you correctly point out, licensing prevents direct copying of code.