Parsing R code: Freedom of expression is not always a good idea
With my growing interest in R it was inevitable that I would end up writing a parser for it. The fact that the language is relatively small (the add-on packages do the serious work) hastened the event because it did not look like much work; famous last words. I knew about R’s design and implementation being strongly influenced by the world view of functional programming and this should have set alarm bells ringing; this community have a history of willfully ignoring some of the undesirable consequences of their penchant of taking simple ideas and over generalizing them (i.e., I should have expected hidden complications).
While the official R language definition only contains a tiny fraction of the information needed to create a full implementation I decided to use it rather than ‘cheat’ and look at the R project implementation sources. I took as my oracle of correctness the source code of the substantial amount of R in its 3,000+ package library. This approach would help me uncover some of the incorrect preconceived ideas I have about how R source fits together.
I started with a C lexer and chopped and changed (it is difficult to do decent error recovery in automatically generated lexers and I prefer to avoid them). A few surprises cropped up **
is supported as an undocumented form of ^
and by default ]]
must be treated as two tokens (e.g., two ]
in a[b[c]]
but one ]]
in d[[e]]
, an exception to the very commonly used maximal munch rule).
The R grammar is all about expressions with some statement bits and pieces thrown in. R operator precedence follows that of Fortran, except the precedence of unary plus/minus has been increased to be above multiply/divide (instead of below). Easy peasy, cut and paste an existing expression grammar and done by tea time :-). Several tea times later I have a grammar that parses all of the R packages (apart from 80+ real syntax errors contained therein and a hand full of kinky operator combinations I’m not willing to contort the grammar to support). So what happened?
Two factors accounts for most of the difference between my estimate of the work required and the actual work required:
- my belief that a well written grammar has no ambiguities (while zero is a common goal for many projects a handful might be acceptable if the building is on fire and I have to leave). A major advantage of automatic generation of parser tables from a grammar specification is being warned about ambiguities in the grammar (either shift/reduce or reduce/reduce conflicts). At an early stage I was pulling my hair out over having 59 conflicts and decided to relent and look at the R project source and was amazed to find their grammar has 81 ambiguities!
I have managed to get the number of ambiguities down to the mid-30s, not good at all but it will have to do for the time being.
- some of my preconceptions about of how R syntax worked were seriously wrong. In some cases I spotted my mistake quickly because I recognized the behavior from other languages I know, other misconceptions took a lot longer to understand and handle because I did not believe anybody would design expression evaluation to work that way.
The root cause of the difference can more concretely be traced to the approach to specifying language syntax. The R project grammar is written using the form commonly seen in functional language implementations and introductory compiler books. This form has the advantage of being very short and apparently simple; the following is a cut down example in a form of BNF used by many parser generators:
expr : expr op expr | IDENTIFIER ; op : &' | '==' | '>' | '+' | '-' | '*' | '/' | '^' ; |
This specifies a sequence of IDENTIFIER
s separated by binary operators and is ambiguous when the expression contains more than two operators, e.g., a + b * c
can be parsed in more than one way. Parser generators such as Yacc will complain and flag any ambiguity and pick one of the possibilities as the default behavior for handling a given ambiguity; developers can specify additional grammar information in the file read by Yacc to guide its behavior when deciding how to resolve specific ambiguities. For instance, the relative precedence of operators can be specified and this information would be used by Yacc to decide that the ambiguous expression a + b * c
should be parsed as-if it had been written like a + (b * c)
rather than like (a + b) * c
. The R project grammar is short, highly ambiguous and relies on the information contained in the explicitly specified relative operator precedence and associativity directives to resolve the ambiguities.
An alternative method of specifying the grammar is to have a separate list of grammar rules for each level of precedence (I always use this approach). In this approach there is no ambiguity, the precedence and associativity are implicitly specified by how the grammar is written. Obviously this approach creates much longer grammars, there will be at least two rules for every precedence level (19 in R, many with multiple operators). The following is a cut down example containing just multiple, divide, add and subtract:
... multiplicative_expression: cast_expression | multiplicative_expression '*' cast_expression | multiplicative_expression '/' cast_expression ; additive_expression: multiplicative_expression | additive_expression '+' multiplicative_expression | additive_expression '-' multiplicative_expression ; ... |
The advantages of this approach are that, because there are no ambiguities, the developer can see exactly how the grammar behaves and if an ambiguity is accidentally introduced when editing the source it should be noticed when the parser generator reports a problem where previously there were none (rather than the new ambiguity being hidden in the barrage of existing ones that are ignored because they are so numerous).
My first big misconception about R syntax was to think that R had statements, it doesn’t. What other languages would treat as statements R always treats as expressions. The if
, for
and while
constructs have values (e.g., 2*(if (x == y) 2 else 4)
). No problem, I used Algol 68 as an undergraduate, which supports this kind of usage. I assumed that when an if
appeared as an operand in an expression it would have to be bracketed using ()
or {}
to avoid creating a substantial number of parsing ambiguities; WRONG. No brackets need be specified, the R expression if (x == y) 2 else 4+1
is ambiguous (it could be treated as-if it had been written if (x == y) 2 else (4+1)
or (if (x == y) 2 else 4)+1
) and the R project grammar relies on its precedence specification to resolve the conflict (in favor of the first possibility listed above).
My next big surprise came from the handling of unary operators. Most modern languages give all unary operators the same precedence, generally higher than any binary operator. Following Fortran the R unary operators have a variety of different precedence levels; however R did not adopt the restrictions Fortran places on where unary operators can occur.
I assumed that R had adopted the restrictions used by other languages containing unary operators at different precedence levels, e.g., not allowing a unary operator token to follow a binary operator token (i.e., there has to be an intervening opening parenthesis); WRONG. R allows me to write 1 == !1 < 0
, while Fortran (and Ada, etc) require that a parenthesis be inserted between the operands == !
(hopefully resulting in the intent being clearer).
I had to think a bit to come up with an explicit set of grammar rules to handle R unary operator's freedom of expression (without creating any ambiguities).
Stepping back from the details. My view is that programming language syntax should be designed to reduce the number of mistakes developers make. Requiring that brackets appear in certain contexts helps prevent mistakes by the original author and subsequent readers of the code.
Claims that R (or any other language) syntax is 'natural' is clearly spurious and really no more than a statement of preference by the speaker. Our DNA has not yet been found to equip us to handle one programming language better than another.
Over the coming months I hope to have the time to analyse R source looking for faults that might not have occurred had brackets been used. Also how much code might be broken if R started to require brackets in certain contexts?
An example of the difference that brackets can make is provided by the handling of the unary !
operator in R and C/C++/Java/etc. Take the expression !x > y
, which R parses as-if written !(x > y)
and C/C++/Java/etc as if written (!x) > y
. I would not claim that either is better than the other from the point of view of developers getting the behavior right, I know that some C programmers get it wrong and I suspect that some R programmers do too.
By increasing the precedence of unary plus/minus the R designers ensured that 8/-2/2
was parsed like (8/-2)/2
rather than 8/(-2/2)
.
Interesting. The everything-is-an-expression nature of R is pretty useful. But the stuff about precedence is scary! Most of my issues with R syntax/semantics are at a slightly higher level, in particular the way lazy evaluation allows use of bare words in ways that are highly confusing. I don’t think they’re very complex from the lexer/parser’s point of view, though.
Have you taken a look at the nascent Julia language yet? http://julialang.org/
@Harlan
Yes, I have read a few blog posts about Julia and had a brief look at it. I think there are too many language and people should stop inventing new ones. As somebody whose business is source code I always welcome new languages as they are a potential source of revenue. From a users point of view they are mostly cost with probably no benefit.
R is taking off in a big way and the community needs to recognise that maintenance of R programs is going to be a big issue in the years to come. Features need to be added to the language that help catch errors made during maintenance. Probably the most important new feature I would add is variable declarations, they provide information about how a variable is intended to be used and knowing when a variable is being used in an unintended way is generally useful. As the above post points out, requiring brackets in a few contexts would help reduce confusion.
People should stop? You’ll love this then 🙂
http://xianblog.wordpress.com/2010/09/13/simply-start-over-and-build-something-better/
There are a number of papers and presentations on the new work here:
http://www.stat.auckland.ac.nz/~ihaka/?Papers_and_Talks
@Mike
Thanks for the links (I had read the first one). The performance penalty of call by value is well known as is the solution, add support for call by reference. Have the R code team got the functional programming ‘religion’ that regards call by reference as a mortal sin? With Ihaka promoting Lisp as the future base of R one has to wonder.
I would add support for a & unary operator, which when added to an identifier in a function definition specified that that argument was passed by reference (rather than by value).
Also the R interpreter is a lot slower than it need be because of its 1980s design. Somebody on the R team needs to read Smith & Nair’s “Virtual machines” to learn how to write an interpreter that is tuned to the characteristics of modern processors.
Great post. As someone just starting to learn R you’ve pointed out things that I didn’t realize were issues. I need to go back through my code and make sure I’ve not added bugs with my C-centric view of programming.
Very interesting stuff. Do you think anything from your new parser could be incorporated into R’s existing parser?
@Richie Cotton
I have got the number of shift/reduce conflicts down to 30, which is still way too many. As the
if
example above shows R is intrinsically ambiguous (or to be technically correct not context free). I think I should be able to come up with a grammar that has a small number of conflicts, but I’m not sure how small ‘small’ willl be.The advantage of having a small number (fingers of one hand would be good) of conflicts is that people can remember what they are and spot when changes break the existing grammar. So until somebody comes up with a grammar that only contains a small number of conflicts there is no advantage of changing over.
Thank you! I’ve had this problem with programming languages for decades. If I can’t look at code and clearly distinguish what happens when, what the operands and operations are and what the orderings are, it’s bad code!
I have the same problem with Ruby, Perl in “poetry mode” and CoffeeScript. JavaScript is dicey, but at least a coder can be taught to behave via JSLint, JSHint and “JavaScript – The Good Parts”.
Just out of curiosity, how bad is R (lexical scoping) relative to the other dialects of S (dynamic scoping)?
@M. Edward (Ed) Borasky
Dynamic scoping seems to make more sense than lexical scoping in a language that does not support explicit variable declarations. However, if most programs are short (i.e., a few hundred lines at most; as R programs tend to be) then the inconvenience of lexical scoping is not that great; also in a library call based language (which R is) forcing all the information passing through parameters probably results in fewer bugs than allowing information to be passed via dynamically scoped variables.
I don’t know if these thoughts were in the minds of the R designers at the time they made the switch or whether it was some major implementation issue that caused the switch in behavior. There is also the issue of language fashion, dynamic scoping goes in and out of being fashionable.