Are GCC and Clang parsers really handwritten?

C Parsing Compiler Construction Compilation

C Problem Overview

It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.

Could someone here please confirm that this is the case? And if so, why do mainstream compiler frameworks use handwritten parsers?

Update : interesting blog on this topic here

C Solutions

Solution 1 - C

There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.

It isn't true.

What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.

What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The http://www.scottmcpeak.com/elkhound/">Elsa</a> C++ parser is one example of this. Our http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html">C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).

Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.

But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.

You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.

What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.

EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.

EDIT February 2015: ... now handles full C++14. (See https://stackoverflow.com/questions/17388771/get-human-readable-ast-from-c-code/17393852#17393852 for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").

EDIT April 2017: Now handles (draft) C++17.

Solution 2 - C

Yes:

GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .

Solution 3 - C

Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.

Clang uses a recursive-descent parser for several reasons:

Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.

Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).

Solution 4 - C

Weird answers there!

C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.

Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...

Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...

Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".

Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.

Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.

C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k). Look at func<4 > 2> which is a template function instantiated with 4 > 2, i.e. func<4 > 2> has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens) ... until it works).

I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?

Solution 5 - C

gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:

Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.

This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".

Solution 6 - C

> It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.

Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.

I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.

In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.

As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n) and I doubt parsing is the bottleneck in much tooling.

Content Type	Original Author	Original Content on Stackoverflow
Question	JCLL	View Question on Stackoverflow
Solution 1 - C	Ira Baxter	View Answer on Stackoverflow
Solution 2 - C	Matthew Slattery	View Answer on Stackoverflow
Solution 3 - C	Doug	View Answer on Stackoverflow
Solution 4 - C	reuns	View Answer on Stackoverflow
Solution 5 - C	Rafe Kettler	View Answer on Stackoverflow
Solution 6 - C	Vanessa McHale	View Answer on Stackoverflow