September 10, 2014 Derek Jones 1 comment

Language parsing was once a hot topic within computing research. The discovery of LALR parsing, quickly followed by yacc becoming available on Unix, resulted in this approach to language parsing dominating developer mind-share (helped by the first half of most compiler books being devoted to the theory of LR parsing). Until maybe 10 years ago the received wisdom was to implement parsers using Bison (the GNU successor to yacc); this process automatically creates arrays of values that are read by a parser to decide how to process the tokens fed to it by a lexer. The accepted wisdom has now shifted to creating hand written recursive decent parsers (or some variant), where the developer writes code that decides what to do next based on the current token(s); developers are back doing things the way they were done before yacc was written in 1970.

Is this change of implementation choice driven by fashion (despite heroic efforts nobody has been able to produce an industrial strength LALR based parser for C++; all C++ compilers that I am aware of use recursive descent and, sad to say, C++ is a trend setter), existing languages outgrowing existing parsing technology or just developers forgetting what a maintenance nightmare recursive descent can be?

I’m a fan of using tools and the big advantage parser generators have over hand written parsers is that they warn about ambiguities in the syntax, i.e., potential faults in the specification or implementation. Hand written recursive decent is just code that does what is written.

The big disadvantage of LALR parsing are restrictions on the form of the grammars that are accepted (in practice the tools usually complain that an ambiguity cannot be resolved and make use of some default behavior to handle it). Transforming a grammar into a form acceptable to tools, such as Bison, without too many warnings being generated, can take a lot of work by an experienced compiler developer. I once spent a month creating a workable LALR grammar for all of SQL-92 and could have written a recursive decent parser in less time (grammar transformations are a potential source of faults as much as hand written parsers are).

Introductions to parsing sometimes mention how much easier life would be using Early parsing, if only its performance was not so appalling. It turns out that a linear algorithm for Early parsing was published in 1991, followed by various useful refinements in 2001 (all discussed in what is effectively the encyclopedia of parsing sitting on my shelf waiting to be read). Theory will sit on the shelf until somebody implements it and a few days ago I found out about Marpa, a linear time Early parser.

So why does Early parsing make life so much easier, at least for those implementing parsers, than LALR parsing? Early parsing has far fewer restrictions on the form of the grammars it accepts. This means no more spending a month transforming a grammar into something acceptable to the tool being used (at least in theory, I have not tried any large grammars yet; somebody has written one for C).

Another benefit from using an Early parser is the potential for improved syntax error recovery, the drive to reduce the size of the arrays generated by yacc/Bison resulted in information essential for good error recovery being thrown away (the original LALR theory threw some useful information away and over the years several PhDs were awarded to researchers who figured out how to throw even more away). When things go wrong Early parsers have lots of useful information to them.

To check out the hype I’m jumping in at the deep end with the grammar for C++14, can I really cut-and-paste the grammar from the appendix, add in some Marpa syntax and start parsing C++? I will let you know whether I sink or swim.

Categories: Uncategorized Tags: Early parsing, LALR, Marpa, parsing, recursive descent, syntax error

The Shape of Code

Archive

Is Early parsing now practical?

Recent Posts

Recent Comments

Archives

Meta