Why can't C++ be parsed with a LR(1) parser?

C++ Problem Overview

I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:

> Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.

Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?

Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).

C++ Solutions

Solution 1 - C++

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

It can be the declaration of y, as pointer to type x
It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses, and therefore a program can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See https://stackoverflow.com/a/17393852/120163">the AST for C++'s most vexing parse.)

Solution 2 - C++

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

> "C++ grammar is ambiguous, > context-dependent and potentially > requires infinite lookahead to resolve > some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

Solution 3 - C++

The problem is never defined like this, whereas it should be interesting :

what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)

I see a few ones:

Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).

There are 3 types of names tokens :

types : builtin-type or because of a typedef/class/struct
template-functions
identifiers : functions/methods and variables/objects

Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.

Type a(2); is an object instantiation. Type a(); and Type a(int) are function prototypes.
int (k); is completely forbidden, should be written int k;
typedef int func_type(); and typedef int (func_type)(); are forbidden.

A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();

template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d; int (*f)();

int (*g)()[9];

int h(char);

one line per function prototype or function pointer declaration.

An highly preferred alternative would be to change the awful function pointer syntax,

int (MyClass::*MethodPtr)(char*);

being resyntaxed as:

int (MyClass::*)(char*) MethodPtr;

this being coherent with the cast operator (int (MyClass::*)(char*))

typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become

typedef int type;

typedef int *type_ptr;
sizeof int, sizeof char, sizeof long long and co. could be declared in each source file. Thus, each source file making use of the type int should begin with

#type int : signed_integer(4)

and unsigned_integer(4) would be forbidden outside of that #type directive this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers

The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.

Please add your ambiguous C++ syntaxes if you know some!

Solution 4 - C++

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

Solution 5 - C++

I think you are pretty close to the answer.

LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

Solution 6 - C++

The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).

/* C Typedef Solution. */
  
/* Terminal Declarations. */
  
   <identifier>	=> lookup();  /* Symbol table lookup. */
  
/* Rules. */
  
   Goal		   -> [Declaration]... <eof>			   +> goal_
 
   Declaration -> Type... VarList ';'				   +> decl_
	           -> typedef Type... TypeVarList ';'	   +> typedecl_
  
   VarList	   -> Var /','...     
   TypeVarList -> TypeVar /','...
  
   Var		   -> [Ptr]... Identifier 
   TypeVar	   -> [Ptr]... TypeIdentifier								
 
   Identifier	  -> <identifier>	    +> identifier_(1)	   
   TypeIdentifier -> <identifier>      =+> typedefidentifier_(1,{typedef})
  
// The above line will assign {typedef} to the <identifier>,  
// because {typedef} is the second argument of the action typeidentifier_(). 
// This handles the context-sensitive feature of the C++ language.
  
   Ptr			-> '*'					+> ptr_
 
   Type			-> char					+> type_(1)
				-> int					+> type_(1)
				-> short				+> type_(1)
				-> unsigned				+> type_(1)
				-> {typedef}			+> type_(1)
  
/* End Of Grammar. */

The following input can be parsed without a problem:

 typedef int x;
 x * y;

 typedef unsigned int uint, *uintptr;
 uint    a, b, c;
 uintptr p, q, r;

The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)

Content Type	Original Author	Original Content on Stackoverflow
Question	Cheery	View Question on Stackoverflow
Solution 1 - C++	Ira Baxter	View Answer on Stackoverflow
Solution 2 - C++	Rob Walker	View Answer on Stackoverflow
Solution 3 - C++	reuns	View Answer on Stackoverflow
Solution 4 - C++	Sam Harwell	View Answer on Stackoverflow
Solution 5 - C++	casademora	View Answer on Stackoverflow
Solution 6 - C++	user1524750	View Answer on Stackoverflow