May 3, 2026 Derek Jones No comments

It’s been 10 years since I last wrote about parsing tools, and the C parser, pycparser, I took for a test drive is still actively maintained. This week I read a post on Gecko, a new parser generator. Its author, Vladimir Makarov, implemented his first parser generator in 1985.

Gecko generates GLR parsers (Generalized Left-to-Right). In 2009, I predicted that GLR parsing was the future. It might still be the future, but since I made that prediction handwritten parsers, using some form of recursive descent, are what the major compilers (e.g., gcc and llvm) have been updated to use. Bison, the almost invisible market leader for parser generation, has supported GLR parsers for almost 20 years. The other ‘generalized’ technique, Earley parsing, produces parsers that are much slower and are memory hogs.

GLR parsers support Type-1 languages in the Chomsky hierarchy. The LR parsers supported by yacc compatible tools (e.g., the Bison default mode), and LL by ANTLR, can handle Type-2 languages, and regular expressions are Type-3 languages.

Programming language grammars are often context-sensitive (ambiguous is the common developer terminology), i.e., there is more than one way of parsing a sequence of input tokens. The classic example is the C statement: T *p;, which could be a declaration of p, or a redundant multiplication. This ambiguity can be resolved by maintaining a list of identifiers currently defined as typedefs, and have the lexer/parser lookup the status of identifiers in the contexts where a typedef could occur. This is not a big deal for compilers, which have to build a symbol table anyway. However, it’s very inconvenient when only syntax analysis is needed, i.e., no semantic analysis of the source.

An alternative approach is to parse all possibilities, and hope that eventually only one parse is syntactically possible. The following example could work, because there is a subsequent use of T in a non-typedef context (I’m not aware of any tools that do this):

T *p;  // Is this a declaration of p as a pointer to T?
T++;   // No!  It's a multiplication of T by p

Another approach is to choose the most likely parse. Redundant multiplications are rare, and a declaration is the most likely usage. The token sequence f(x); is most likely to be a function call with one argument, rather than redundant parenthesis around a declaration of x to have type f.

Taking Gecko for a test drive requires a lexer and a grammar. Fortunately, one of the Gecko test cases includes a C lexer/grammar, and I adapted this to try out some C syntax test cases (code). My comparison point for these tests my memory of testing out Bison with GLR enabled.

Developers make coding mistakes, and I made mistakes when adapting the existing Gecko C grammar. Perhaps because I’m new to it, but Gecko’s minimalist error reporting was not helpful. Lots of debug information is available, but this is oriented towards somebody developing the innards of a parser generator. Hopefully, now Gecko is up and working, the focus will shift to improving developer diagnostics.

When Bison fails to merge multiple parses into a single parse, it failed. Gecko appears not to fail (it’s difficult to tell), it returns a parse tree.

Coding mistakes are sometime syntax errors, and without some form of error recovery, syntax errors often cascade to produce lots of spurious errors. Recovering from syntax errors is hard, but skipping to the next semicolon works remarkably well as a catch-all.

In Bison, syntax error recovery has to be hand-coded into the grammar and parser. Gecko supports an automatic syntax error recovery process. Based on a small sample, this automatic process failed to handle the common syntax errors (e.g., missing identifier or missing operator in an expression) I tried it on (code). It did handle the example in the documentation. Perhaps this is a work in progress.

The Gecko source built and passed all of its own tests. My tests are intended to check for handling of ambiguous constructs and error handling. As such, they are not pass/fail.

The main functional difference between Gecko and Bison is that Gecko is compiled into the program and can then be used to read and process a grammar at program runtime. Bison processes the grammar to produce tables that are included as part of the build process of a program.

This difference enables Gecko to handle grammars that are created or updated at application runtime. This approach also simplifies the process of handling multiple grammars.

While on the subject of parser generators, I have been following the progress of Marpa, but not tried it yet. The author has some interesting things to say about parsing.

Categories: Uncategorized Tags: ambiguous grammar, C, Earley parsing, GLR, Marpa, parsing, syntax

Parsing ambiguous grammars (part 1)

March 4, 2009 Derek Jones No comments

Parsing a language is often much harder than people think, perhaps because they have only seen examples that use a simple language that has been designed to make explanation easy. Most languages in everyday use contain a variety of constructs that make the life of a parser writer difficult. Yes, there are parser generators, tools like bison, that automate the process of turning a grammar into a parser and a language’s grammar is often found in the back of its reference manual. However, these grammars are often written to make the life of the programmer easier, not the life of the parse writer.

People may have spotted technical term like LL(1), LR(1) and LALR(1); what they all have in common is a 1 in brackets, because they all operate by looking one token ahead in the input stream. There is a big advantage to limiting the lookahead to one token, the generated tables are much smaller (back in the days when these tools were first created 64K was considered to be an awful lot of memory and today simple programs in embedded processors, with limited memory, often use simple grammars to parse communication’s traffic). Most existing parser generators operate within this limit and rely on compiler writers to sweat over, and contort, grammars to make them fit.

A simple example is provided by PL/1 (most real life examples tend to be more complicated) which did not have keywords, or to be exact did not restrict the spelling of identifiers that could be used to denote a variable, label or procedure. This meant that in the following code:

IF x THEN y = z; ELSE = w;

when the ELSE was encountered the compiler did not know whether it was the start of the alternative arm of the previously seen if-statement or an assignment statement. The token appearing after the ELSE needed to be examined to settle the question.

In days gone-by the person responsible for parsing PL/1 would have gotten up to some jiggery-pokery, such as having the lexer spot that an ELSE had been encountered and process the next token before reporting back what it had found to the syntax analysis.

A few years ago bison was upgraded to support GLR parsing. Rather than lookahead at more tokens a GLR parser detects that there is more than one way to parse the current input and promptly starts parsing each possibility (it is usually implemented by making copies of the appropriate data structures and updating each copy according to the particular parse being followed). The hope is that eventually all but one of these multiple parsers will reach a point where they cannot successfully parse the input tokens and can be killed off, leaving the one true parse (the case where multiple parses continue to exist was discussed a while ago; actually in another context).

Categories: Uncategorized Tags: ambiguous grammar, embedded processor, language grammar, lexer, limited memory, parsing, PL/1, syntax, tools

The Shape of Code

Archive

Taking a new GLR parser generator for a spin

Parsing ambiguous grammars (part 1)

Recent Posts

Recent Comments

Archives

Meta