My R naming nemesis
When learning a new language I try to make an effort to write it like a native developer. R has one language feature that has been severely testing my desire to write like a native and this afternoon I realized that most of the people reading my code will also experience the same jarring sensation on encountering this construct, so I am not going to use it any more.
What is this language feature that induces a Stroop effect in my mind? It is the use of the period character as part of an identifier’s name (e.g., foo.bar
). In almost all of the hundreds of thousands of lines of code I have read over the years this character is used as an operator, it selects a member/field of a struct/record. I’m sure that if I tried long enough and hard enough I could get used to using this character being part of an identifier; after a year or so writing Cobol I got used to the arithmetic minus character being permitted within identifiers (e.g., foo-bar
), but that was 20 years ago and my neurons will probably take much longer to adapt this time around.
Most of the R I am writing will be distributed with my book Empirical software engineering with R and I think readers will experience the same jarring sensation I do (apart from those who have not yet been exposed to large amounts of non-R code). I have convinced myself that this is a good enough reason to give up trying to figure out how to use .
in identifier name (I have been concocting all sorts of rules involving .
being used to separate the primary part of the name and _
the secondary parts, e.g., total.red_light
[yes, I should get out more often]; the underscore vs. camel case debate still erupts every now and again, let’s avoid creating more debate by introducing more choice).
Those R functions that include a .
in their name will stand out from the crowd, [arm waving on] perhaps this will help differentiate them as ‘statistics stuff'[arm waving off]. There is always plan B if my unilateral naming decision looks too unilateral, a global renaming script.
Perhaps the use of periods in identifiers can be used as a test for being a native R developer. A simple timing test involving a sequence of characters appears on a screen with the developer having to respond as quickly as possible on the number of identifiers being displayed; I’m sure I would be much slower to give a ‘1’ response to total.count
than to total_count
, displaying total count
and total.count
on twp separate lines and asking me to quickly specify which line contained the most identifiers would turn me into a nervous wreck. Responses from a dozen or so different sequences ought to be enough be able to distinguish Jonny foreigner from the natives.
I don’t have a problem with $
, which R uses as the column/list item selection operator, a character permitted by some compilers for commonly used languages as part of an identifier. This is because I have not read lots of code containing this identifier naming usage.
For my previous book I did a survey of the linguistic and cognitive psychology issues involved in identifier naming. This did a good job of debunking existing ideas about what constitutes good naming practices, but did not come up with any concrete recommendations to replace them (nature abhors a vacuum and the existing pop psychology naming ideas remained).
These days people write PhDs on identifier naming issues (method names, (not yet completed) correlation with quality and code comprehension to name a few); there is even a subfield within this field, how best to split an identifier into its component parts (e.g., refPtr
is probably an abbreviation of reference pointer).
Now, how do we get used to the mix of noun.verb and verb.noun naming?
As an engineer myself, I personally don’t have a problem with the ‘.’ in names, but I agree that it can be jarring at first. However, I think the real issue is to be obviously consistent with whatever naming convention you use. For instance, I have a problem with ‘foo’ being an object by itself with ‘foo.bar’ being a different and completely unrelated object. I see this a lot in people’s R code, and it makes reading very difficult for me. Some popular programming languages use the ‘.’ to denote membership where ‘foo.bar’ indicates [bar] as a member of object/class [foo], so I avoid doing this in my own code.
I’ve read a number of books on R, and everyone seems to use a slightly different naming convention. The difference between the books I liked and the ones I struggled with had mostly to do with how obvious the naming conventions were. The more obvious and consistent the naming convention, the easier the code was for me to read.
It seems to me that what misses in this discussion is the function of dots in R naming: dots in R function names serve as a means to realize R’s oldest class system (S3 classes). A METHODNAME follows the pattern GENERICNAME.CLASSNAME as in summary.lm, where calling GENERICNAME(OBJECT OF CLASSNAME) will dispatch to GENERICNAME.CLASSNAME if present and to GENERICNAME.default otherwise. This allows for something like overloading in other languages.
(resubmitted with placeholders in uppercase since angle brackets dispear here)
@Jens Oehlschlägel
That is certainly a useful naming convention and further adds to the feeling that dot is an operator. However, there is nothing to stop developers using this character in other contexts and I have seen some who seem to use this character where others would use underscore.
It annoys me when a language designer decides to break with decades of tradition and employ a character for a radically different purpose.
It smacks of arrogance, crankiness, and a lack of genuinely original thought.