How to write a Parser in C#?

C#ParsingXml ParsingInterpreter

C# Problem Overview


How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).

Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?

C# Solutions


Solution 1 - C#

I have implemented several parsers in C# - hand-written and tool generated.

A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.

You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).

On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.

Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).

Solution 2 - C#

Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.

using System;
using System.Linq.Expressions;
using Sprache;

namespace LinqyCalculator
{
    static class ExpressionParser
    {
        public static Expression<Func<decimal>> ParseExpression(string text)
        {
            return Lambda.Parse(text);
        }

        static Parser<ExpressionType> Operator(string op, ExpressionType opType)
        {
            return Parse.String(op).Token().Return(opType);
        }

        static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
        static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
        static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
        static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);

        static readonly Parser<Expression> Constant =
            (from d in Parse.Decimal.Token()
             select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");

        static readonly Parser<Expression> Factor =
            ((from lparen in Parse.Char('(')
              from expr in Parse.Ref(() => Expr)
              from rparen in Parse.Char(')')
              select expr).Named("expression")
             .XOr(Constant)).Token();

        static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);

        static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);

        static readonly Parser<Expression<Func<decimal>>> Lambda =
            Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
    }
}

Solution 3 - C#

C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial

It is also possible to implement a combinator-based http://pdos.csail.mit.edu/~baford/packrat/thesis/">Packrat</a>;, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like http://code.google.com/p/peg-sharp/">this</a> must perform better.

Solution 4 - C#

I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.

Solution 5 - C#

In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser: http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa

Solution 6 - C#

Well... where to start with this one....

First off, writing a parser, well that's a very broad statement especially with the question your asking.

Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.

Back to the parsing question....

This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.

In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".

In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.

if we start with the following XML:

    <movies>
      <movie id="1">
        <name>Tron</name>
      </movie>
      <movie id="2">
        <name>Tron Legacy</name>
      </movie>
    <movies>

we can load the data into an XElement as follows:

    XElement myXML = XElement.Load("mymovies.xml");

you can then get at the 'movies' root element using 'myXML.Root'

MOre interesting however, you can use Linq easily to get the nested tags:

    var myElements = from p in myXML.Root.Elements("movie")
                     select p;

Will give you a var of XElements each containing one '...' which you can get at using somthing like:

    foreach(var v in myElements)
    {
      Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
    }

For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.

You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.

Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.

A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )

Solution 7 - C#

For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).

However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionApprenticeHackerView Question on Stackoverflow
Solution 1 - C#Jonathan DickinsonView Answer on Stackoverflow
Solution 2 - C#Martin LiversageView Answer on Stackoverflow
Solution 3 - C#SK-logicView Answer on Stackoverflow
Solution 4 - C#000View Answer on Stackoverflow
Solution 5 - C#Ken BeckettView Answer on Stackoverflow
Solution 6 - C#shawtyView Answer on Stackoverflow
Solution 7 - C#greenoldmanView Answer on Stackoverflow