Resources for lexing, tokenising and parsing in python

PythonParsingResourcesLex

Python Problem Overview


Can people point me to resources on lexing, parsing and tokenising with Python?

I'm doing a little hacking on an open source project (hotwire) and wanted to do a few changes to the code that lexes, parses and tokenises the commands entered into it. As it is real working code it is fairly complex and a bit hard to work out.

I haven't worked on code to lex/parse/tokenise before, so I was thinking one approach would be to work through a tutorial or two on this aspect. I would hope to learn enough to navigate around the code I actually want to alter. Is there anything suitable out there? (Ideally it could be done in an afternoon without having to buy and read the dragon book first ...)

Edit: (7 Oct 2008) None of the below answers quite give what I want. With them I could generate parsers from scratch, but I want to learn how to write my own basic parser from scratch, not using lex and yacc or similar tools. Having done that I can then understand the existing code better.

So could someone point me to a tutorial where I can build a basic parser from scratch, using just python?

Python Solutions


Solution 1 - Python

I'm a happy user of PLY. It is a pure-Python implementation of Lex & Yacc, with lots of small niceties that make it quite Pythonic and easy to use. Since Lex & Yacc are the most popular lexing & parsing tools and are used for the most projects, PLY has the advantage of standing on giants' shoulders. A lot of knowledge exists online on Lex & Yacc, and you can freely apply it to PLY.

PLY also has a good documentation page with some simple examples to get you started.

For a listing of lots of Python parsing tools, see this.

Solution 2 - Python

This question is pretty old, but maybe my answer would help someone who wants to learn the basics. I find this resource to be very good. It is a simple interpreter written in python without the use of any external libraries. So this will help anyone who would like to understand the internal working of parsing, lexing, and tokenising:

"A Simple Intepreter from Scratch in Python:" Part 1, Part 2, Part 3, and Part 4.

Solution 3 - Python

For medium-complex grammars, PyParsing is brilliant. You can define grammars directly within Python code, no need for code generation:

>>> from pyparsing import Word, alphas
>>> greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
>>> hello = "Hello, World!"
>>>> print hello, "->", greet.parseString( hello )
Hello, World! -> ['Hello', ',', 'World', '!']

(Example taken from the PyParsing home page).

With parse actions (functions that are invoked when a certain grammar rule is triggered), you can convert parses directly into abstract syntax trees, or any other representation.

There are many helper functions that encapsulate recurring patterns, like operator hierarchies, quoted strings, nesting or C-style comments.

Solution 4 - Python

pygments is a source code syntax highlighter written in python. It has lexers and formatters, and may be interesting to peek at the source.

Solution 5 - Python

Solution 6 - Python

Have a look at the standard module shlex and modify one copy of it to match the syntax you use for your shell, it is a good starting point

If you want all the power of a complete solution for lexing/parsing, ANTLR can generate python too.

Solution 7 - Python

Frederico Tomassetti had a good (but short) concise write-up to all things related from BNF to binary deciphering on:

  • lexical,
  • parser,
  • abstract-syntax tree (AST), and
  • Construct/code-generator.

He even mentioned the new Parsing Expression Grammar (PEG).

https://tomassetti.me/parsing-in-python/

Solution 8 - Python

I suggest http://www.canonware.com/Parsing/, since it is pure python and you don't need to learn a grammar, but it isn't widely used, and has comparatively little documentation. The heavyweight is ANTLR and PyParsing. ANTLR can generate java and C++ parsers too, and AST walkers but you will have to learn what amounts to a new language.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionHamish DownerView Question on Stackoverflow
Solution 1 - PythonEli BenderskyView Answer on Stackoverflow
Solution 2 - PythonSaadView Answer on Stackoverflow
Solution 3 - PythonTorsten MarekView Answer on Stackoverflow
Solution 4 - PythonnilamoView Answer on Stackoverflow
Solution 5 - PythonTony ArklesView Answer on Stackoverflow
Solution 6 - PythonPW.View Answer on Stackoverflow
Solution 7 - PythonJohn GreeneView Answer on Stackoverflow
Solution 8 - PythonnimishView Answer on Stackoverflow