On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt <[email protected]> wrote:
> It is possible to write a whole parser as a single regex, being it terribly
> long and complex.
>
While regular expressions are often used in the lexer--the part that scans
the input stream and breaks it up into meaningful tokens like
{ keyword: "function" }
{ operator: "+" }
and
{ identifier: "$foo" }
that form the building blocks of the language--they aren't combined into a
single expression. Instead, a lexer generator is used to build a state
machine that switches the active expressions to check based on the previous
tokens and context. Each expression recognizes a different type of token,
and many times these aren't even regular expressions.
The second stage--combining tokens based on the rules of the grammar--is
more complex and beyond the abilities of regular expressions. There are
plenty of books on the subject and tools [1] to build the pieces such as
Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack
Overflow [2] a few years ago. And I'm sure if you look you can find someone
that did a masters thesis proving that regular expressions cannot handle a
context-free grammar. And finally I leave you with Jeff Atwood's article
about (not) parsing HTML with regex. [3]
Peace,
David
[1] http://dinosaur.compilertools.net/
[2]
http://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers
[3]
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html