Push hard on a problem here and it might just pop up over there
One thing I have noticed when reading other peoples’ R code is that their functions are often a lot longer than mine. Writing overly long functions is a common novice programmer mistake, but the code I am reading does not look like it is written by novices (based on the wide variety of base functions they are using, something a novice is unlikely to do, and by extrapolating my knowledge of novice behavior in other languages to R). I have a possible explanation for these longer functions, R users’ cultural belief that use of global variables is taboo.
Where did this belief originate? I think it can be traced back to the designers of R being paid up members of the functional programming movement of the early 80’s. This movement sought to mathematically prove programs correct but had to deal with the limitation that existing mathematical techniques were not really up to handling programs that contained states (e.g., variables that were assigned different values at different points in their execution). The solution was to invent a class of programming languages, functional languages, that did not provide any mechanisms for creating states (i.e., no global or local variables) and using such languages was touted as the solution to buggy code. The first half of the 80’s was full of computing PhD students implementing functional languages that had been designed by their supervisor, with the single application written by nearly all these languages being their own compiler.
Having to use a purely functional language to solve nontrivial problems proved to be mindbogglingly hard and support for local variables crept in and reading/writing files (which hold state) and of course global variables (but you must not use them because that would generate a side-effect; pointing to a use of a global variable in some postgrad’s code would result in agitated arm waving and references to a technique described in so-and-so’s paper which justified this particular use).
The functional world has moved on, or to be exact mathematical formalisms not exist that are capable of handling programs that have state. Modern users of functional languages don’t have any hangup about using global variables. The R community is something of a colonial outpost hanging on to views from a homeland of many years ago.
Isn’t the use of global variables recommended against in other languages? Yes and No. Many languages have different kinds of global variables, such as private and public (terms vary between languages); it is the use of public globals that may raise eyebrows, it may be ok to use them in certain ways but not others. The discussion in other languages revolves around higher level issues like information hiding and controlled access, ideas that R does not really have the language constructs to support (because R programs tend to be short there is rarely a need for such constructs).
Lets reformulate the question: “Is the use of global variables in R bad practice?”
The real question is: Given two programs, having identical external behavior, one that uses global variables and one that does not use global variables, which one will have the lowest economic cost? Economic cost here includes the time needed to figure out how to write the code and time to fix any bugs.
I am not aware of any empirical evidence, in any language, that answers this question (if you know of any please let me know). Any analysis of this question requires enumerating those problems where a solution involving a global variable might be thought to be worthwhile and comparing the global/nonglobal code; I know of a few snippets of such analysis in other languages.
Coming back to these long R functions, they often contain several for
loops. Why are developers using for loops rather than the *ply functions? Is it because the *aply
solution might require the use of a global variable, a cultural taboo that can be avoided by having everything in one function and using a for
loop?
Next time somebody tells you that using global variables is bad practice you should ask for some evidence that backs that statement up.
I’m not saying that the use of global variables is good or bad, but that the issue is a complicated one. Enforcing a ‘no globals’ policy might just be moving the problem it was intended to solve to another place (inside long functions).
In my experience, novice R programmers use lots of “for” loops, and more experienced R programmers tend to use “lapply” and the related functions. I’d be curious to see some of this code you’ve seen by experienced R users. Could these users perhaps be experienced in another language (e.g. python, in which “for” loops are common) but novices in R (in which “for” loops are less common)?
@Zachary Mayer
Yes, the code I have been reading might have been written by developers with lots of experience in other languages who are not wearing a no for-loop hair-shirt, I am not experienced enough to tell. I think novice users are likely to use a limited repertoire of base functions. Novice R + expert other language developer would be expected freely use global variables.
To note, *ply functions do not need global variables. They each have a ‘…’ parameter that you can pass copies of the required global objects. This is particularly important when using the parallel versions of the *ply functions. Using the ‘…’ parameter means seamless replication of data to the worker nodes and a closer to pure functional paradigm.
@Shea Parkes
Good point, most variable accesses are for reading not writing and ‘…’ handles this common case. I don’t recall seeing any instances of this kind of usage but will make a point of looking for it in future.
Why use the functional paradigm, or any paradigm for that matter? Like many other activities where there are many ways of performing a task people adopt a fashion that they are comfortable with. Do you really want to get closer to a paradigm that does not permit the use of local variables? Its an interesting style to attempt for a programming exercise, but not something you would want to have to do for any length of time.
@Derek Jones
I think having clear programming paradigm/strategy is useful to guide your coding. As far as pure functional, I rather like what John Cook discussed: http://www.johndcook.com/blog/2010/04/15/85-functional-language-purity/
@Shea Parkes
Functional programming really consists of two parts, 1) what you are not allowed to do (e.g., no locals of globals to hold changing state) and 2) language features not usually associated with other ‘movements’ or paradigms (e.g., first-class functions, everything is an expression {rather than some things being expressions and other things being statements} and ability to store/manipulate different environments).
What developers mean when they say they are 85% pure functional programmers (as per your link) is that they pay lip service to (1) and live the good life using (2).
The designer of R (from a language perspective) was John Chambers, who was more interested in statistics, than functional programming purity. The original language design came from the mid 70’s, then it became S-Plus, and then R came along around 2000 as a implementation rewrite but very much intended to be compatible.
It is very easy to use global variables in R, if you say “x <- 3" you are using a local variable, but if you say "x <<- 3" the x becomes global. Simple as that. Go for your life.
“Given two programs, having identical external behavior, one that uses global variables and one that does not use global variables, which one will have the lowest economic cost?”
Because R is primarily used as an interactive shell, when you create a global variable you immediately do NOT have identical external behaviour. The user at the console can type “ls()” to see their current global namespace — that is to say the functions and variables that they have defined and the things they are working with. If a function creates a global variable, this pollutes the user’s namespace which is weird and annoying unless the user expects this to happen.
Because the R packaging system is designed to be neat and modular, you would be frowned upon to write a package that spontaneously introduces global variables into the user’s namespace, especially without clear documentation of when and why this happens. However a large number of alternative options are provided.
For example, you can create an object, and attach data to the object, you can even attach a namespace (known in R as an “environment”) to the object and put your globals there. There is no intrinsic difference, but it just keeps things neat. This is the same as the “Singleton pattern” in Java, or “Monads”, or all the other polite ways of saying, “I’m gonna use a global variable (but I’m not gonna make it too obvious).”
Another option is lexical scoping, so you can create a function inside another function and the inner function has access to local variables in the outer function. This means that when using the various members of the apply() family of functions, you don’t need to define a global. Java calls this an “inner class” which is very commonly used for Swing callbacks, as an example.
Yet another feature of R is a way of hiding some global stuff into special places. For example, typing “ls()” shows you just what the user created, but typing “ls( .BaseNamespaceEnv )” shows you a whole heap more than that. The package management system has namespaces for packages and an import/export system, all of which is global, but keeps everything well organized.
There are also a few very special globals like what you see when you type “par()” for example, or “Sys.getenv()” which is useful for scripting.
Finally, the issue with for() loops is that usually they are significantly slower than taking advantage of the intrinsic vector capabilities of the R engine. That said, early versions of R would run the for() loops strictly as an interpreter, but newer versions do some compilation which probably speeds them up a bit. Anyhow, the programming tradition probably came from the older days of interpretive R.