Archive

Posts Tagged ‘Earley parsing’

Taking a new GLR parser generator for a spin

It’s been 10 years since I last wrote about parsing tools, and the C parser, pycparser, I took for a test drive is still actively maintained. This week I read a post on Gecko, a new parser generator. Its author, Vladimir Makarov, implemented his first parser generator in 1985.

Gecko generates GLR parsers (Generalized Left-to-Right). In 2009, I predicted that GLR parsing was the future. It might still be the future, but since I made that prediction handwritten parsers, using some form of recursive descent, are what the major compilers (e.g., gcc and llvm) have been updated to use. Bison, the almost invisible market leader for parser generation, has supported GLR parsers for almost 20 years. The other ‘generalized’ technique, Earley parsing, produces parsers that are much slower and are memory hogs.

GLR parsers support Type-1 languages in the Chomsky hierarchy. The LR parsers supported by yacc compatible tools (e.g., the Bison default mode), and LL by ANTLR, can handle Type-2 languages, and regular expressions are Type-3 languages.

Programming language grammars are often context-sensitive (ambiguous is the common developer terminology), i.e., there is more than one way of parsing a sequence of input tokens. The classic example is the C statement: T *p;, which could be a declaration of p, or a redundant multiplication. This ambiguity can be resolved by maintaining a list of identifiers currently defined as typedefs, and have the lexer/parser lookup the status of identifiers in the contexts where a typedef could occur. This is not a big deal for compilers, which have to build a symbol table anyway. However, it’s very inconvenient when only syntax analysis is needed, i.e., no semantic analysis of the source.

An alternative approach is to parse all possibilities, and hope that eventually only one parse is syntactically possible. The following example could work, because there is a subsequent use of T in a non-typedef context (I’m not aware of any tools that do this):

T *p;  // Is this a declaration of p as a pointer to T?
T++;   // No!  It's a multiplication of T by p

Another approach is to choose the most likely parse. Redundant multiplications are rare, and a declaration is the most likely usage. The token sequence f(x); is most likely to be a function call with one argument, rather than redundant parenthesis around a declaration of x to have type f.

Taking Gecko for a test drive requires a lexer and a grammar. Fortunately, one of the Gecko test cases includes a C lexer/grammar, and I adapted this to try out some C syntax test cases (code). My comparison point for these tests my memory of testing out Bison with GLR enabled.

Developers make coding mistakes, and I made mistakes when adapting the existing Gecko C grammar. Perhaps because I’m new to it, but Gecko’s minimalist error reporting was not helpful. Lots of debug information is available, but this is oriented towards somebody developing the innards of a parser generator. Hopefully, now Gecko is up and working, the focus will shift to improving developer diagnostics.

When Bison fails to merge multiple parses into a single parse, it failed. Gecko appears not to fail (it’s difficult to tell), it returns a parse tree.

Coding mistakes are sometime syntax errors, and without some form of error recovery, syntax errors often cascade to produce lots of spurious errors. Recovering from syntax errors is hard, but skipping to the next semicolon works remarkably well as a catch-all.

In Bison, syntax error recovery has to be hand-coded into the grammar and parser. Gecko supports an automatic syntax error recovery process. Based on a small sample, this automatic process failed to handle the common syntax errors (e.g., missing identifier or missing operator in an expression) I tried it on (code). It did handle the example in the documentation. Perhaps this is a work in progress.

The Gecko source built and passed all of its own tests. My tests are intended to check for handling of ambiguous constructs and error handling. As such, they are not pass/fail.

The main functional difference between Gecko and Bison is that Gecko is compiled into the program and can then be used to read and process a grammar at program runtime. Bison processes the grammar to produce tables that are included as part of the build process of a program.

This difference enables Gecko to handle grammars that are created or updated at application runtime. This approach also simplifies the process of handling multiple grammars.

While on the subject of parser generators, I have been following the progress of Marpa, but not tried it yet. The author has some interesting things to say about parsing.

Is Earley parsing now practical?

September 10, 2014 1 comment

Language parsing was once a hot topic within computing research. The discovery of LALR parsing, quickly followed by yacc becoming available on Unix, resulted in this approach to language parsing dominating developer mind-share (helped by the first half of most compiler books being devoted to the theory of LR parsing). Until maybe 10 years ago the received wisdom was to implement parsers using Bison (the GNU successor to yacc); this process automatically creates arrays of values that are read by a parser to decide how to process the tokens fed to it by a lexer. The accepted wisdom has now shifted to creating handwritten recursive descent parsers (or some variant), where the developer writes code that decides what to do next based on the current token(s); developers are back doing things the way they were done before yacc was written in 1970.

Is this change of implementation choice driven by fashion (despite heroic efforts nobody has been able to produce an industrial strength LALR based parser for C++; all C++ compilers that I am aware of use recursive descent and, sad to say, C++ is a trend setter), existing languages outgrowing existing parsing technology or just developers forgetting what a maintenance nightmare recursive descent can be?

I’m a fan of using tools and the big advantage parser generators have over handwritten parsers is that they warn about ambiguities in the syntax, i.e., potential faults in the specification or implementation. Handwritten recursive descent is just code that does what is written.

The big disadvantage of LALR parsing are restrictions on the form of the grammars that are accepted (in practice the tools usually complain that an ambiguity cannot be resolved and make use of some default behavior to handle it). Transforming a grammar into a form acceptable to tools, such as Bison, without too many warnings being generated, can take a lot of work by an experienced compiler developer. I once spent a month creating a workable LALR grammar for all of SQL-92 and could have written a recursive descent parser in less time (grammar transformations are a potential source of faults as much as handwritten parsers are).

Introductions to parsing sometimes mention how much easier life would be using Earley parsing, if only its performance was not so appalling. It turns out that a linear algorithm for Early parsing was published in 1991, followed by various useful refinements in 2001 (all discussed in what is effectively the encyclopedia of parsing sitting on my shelf waiting to be read). Theory will sit on the shelf until somebody implements it and a few days ago I found out about Marpa, a linear time Early parser.

So why does Early parsing make life so much easier, at least for those implementing parsers, than LALR parsing? Early parsing has far fewer restrictions on the form of the grammars it accepts. This means no more spending a month transforming a grammar into something acceptable to the tool being used (at least in theory, I have not tried any large grammars yet; somebody has written one for C).

Another benefit from using an Early parser is the potential for improved syntax error recovery, the drive to reduce the size of the arrays generated by yacc/Bison resulted in information essential for good error recovery being thrown away (the original LALR theory threw some useful information away and over the years several PhDs were awarded to researchers who figured out how to throw even more away). When things go wrong Early parsers have lots of useful information to them.

To check out the hype I’m jumping in at the deep end with the grammar for C++14, can I really cut-and-paste the grammar from the appendix, add in some Marpa syntax and start parsing C++? I will let you know whether I sink or swim.