R needs some bureaucracy
Writing a program in R is almost bureaucracy free: variables don’t need to be declared, the language does a reasonable job of guessing the type a value might need to be automatically be converted to, there is no need to create a function having a special name that gets called at program startup, the commonly used library functions are ready and waiting to be called and so on.
Not having a bureaucracy is all well and good when programs are small or short lived. Large programs need a bureaucracy to provide compartmentalization (most changes to X
need to be prevented from having an impact outside of X
, doing this without appropriate language support eventually burns out anybody juggling it all in their head) and long lived programs need a bureaucracy to provide version control (because R and its third-party libraries change over time).
Automatically installing a package from CRAN always fetches the latest version. This is all well and good during initial program development. But will the code still work in six months time? Perhaps the author of one of the packages used in the program submits a new version of that package to CRAN and this new version behaves slightly differently, breaking the previously working program. Once the problem is located the developer has either to update their code or manually install the older version of the package. Life would be easier if it was possible to specify the required package version number in the call to the library
function.
Discovering that my code depends on a particular version of a CRAN package is an irritation. Discovering that two packages I use each have a dependency on different versions of the same package is a nightmare. Having to square this circle is known in the Microsoft Windows world as DLL hell.
There is a new paper out proposing a system of dependency versioning for package management. The author proposes adding a version
parameter to the library
function, plus lots of other potentially useful functionality.
Apart from changing the behavior of functions a program calls, what else can a package author do to break developer code? They can create new functions and variables. The following is some code that worked last week:
library("foo") # The function get_question is in this package library("bar") # The function give_answer_42 is in this package (the_question=get_question()) give_answer_42(the_question) |
between last week and today the author of package
foo
(or perhaps the author of one of the packages that foo
has a dependency on) has added support for the function solve_problem_42
and it is this function that will now get called by this code (unless the ordering of the calls to library
are switched). What developers need to be able to write is:
library("foo", import=c("the_question")) # The function get_question is in this package library("bar", import=c("give_answer_42")) # The function give_answer_42 is in this package (the_question=get_question()) give_answer_42(the_question) |
to stop this happening.
The import
parameter enables developers to introduce some compartmentalization into my programs. Yes, R does have namespace management for packages, and I’m pleased to see that its use will be mandatory in R version 3.0.0, but this does not protect programs from functions the package author intends to export.
I’m not sure whether this import
suggestion will connect with R users (who look very laissez faire to me), but I get very twitchy watching a call to library
go off and install lots of other stuff and generate warnings about this and that being masked.
Excellent idea. Where i work has strong testing and validation procedures. So I can only develop on a fixed environment. So when I start a project, I have to download the current version of R and all(4000+) packages from CRAN to use with that version of R.
— Perhaps the author of one of the packages used in the program submits a new version of that package to CRAN and this new version behaves slightly differently, breaking the previously working program.
But, but, but… R is supposed to be both functional and OO. In both cases, one is not supposed to change the return/output from a class/method/function (other than its value; but even doing that implies the previous version was bugged). A change in value from a precedent class/method/function shouldn’t cause a failure in the caller. Unless the caller was depending on a Magic Number to be returned.
The prime directive of OO is that implementation details are shielded from the caller.
Anyway, what R needs is a BDFL (Larry, Guido, Linus). It won’t get one because R is stuck in a time warp: it’s a software analog to the 1982 PC, which was designed as a “personal” computer. At the time that meant the user would write programs for his own use on his machine. It wasn’t until 1-2-3 came around that the PC began to morph into a device that the user used to run someone else’s programs. R is built to the user-as-coder paradigm.
@Robert Young
Many software design methodologies contain features whose purpose is to hide implementation details from the caller. Sometimes developers use these features, sometimes they don’t. R is not an OO language, it is a language that has some OO features.
It seems to me that R development is dependent on volunteers chipping away at person sized problems. A shift to a paid core team of developers could have a huge impact on what can be done. I don’t know anything about the finances of the R foundation, but I suspect what R really needs is a good fund raiser to bring in $15-20 million.
Package versioning and dependency versioning is supported in nearly every Linux distribution. It works well when developers are not lazy and they actually go through with a fine tooth comb and CHECK which versions they REALLY NEED. What often happens is that developers are lazy and they don’t check, so they just set the version to whatever they personally happen to build against — then the end user is actually worse off.
In all cases, keeping your own in-house source repository is a very good idea for all open source products, if you care about being able to go back and repeat what you have done.
I might point out by the way, that if you look at pretty much any XML file, you see a DOCTYPE tag containing a URL, so every single one of those files is subject to the same vagaries of possible random changes in someone else’s repository. Even reading the XML file may become impossible if that URL goes away (and yes there are many workarounds for this based on local cache designs, none of them particularly comforting).