Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

ParsingLexerTokenize

Parsing Problem Overview


I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.

I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.

I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.

Parsing Solutions


Solution 1 - Parsing

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

Solution 2 - Parsing

Example:

int x = 1;

A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.

A parser will take those tokens and use them to understand in some way:

  • we have a statement
  • it's a definition of an integer
  • the integer is called 'x'
  • 'x' should be initialised with the value 1

Solution 3 - Parsing

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.

I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

Solution 4 - Parsing

(adding to the given answers)

  • Tokenizer will also remove any comments, and only return tokens to the Lexer.
  • Lexer will also define scopes for those tokens (variables/functions)
  • Parser then will build the code/program structure

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionlordhogView Question on Stackoverflow
Solution 1 - ParsingRoger LipscombeView Answer on Stackoverflow
Solution 2 - ParsinganonView Answer on Stackoverflow
Solution 3 - ParsingWill DeanView Answer on Stackoverflow
Solution 4 - ParsingMCHView Answer on Stackoverflow