Parsec: A convenient DSL for writing Parsers

# Parsec: A convenient DSL for writing Parsers ### By: Kendrick, Kang Liang, Chee Yuan --- # Overview 1. Introduction, Monad Transformers, Input Streams 1. Parsec Basics (no combinators) 1. Parsec Combinators, Streams, ParsecT --- # Difference from lecture content - Offering a different perspective - More complete look at Parsec - Peek into source of actual Parsec library --- # Introduction Parsec is a **Domain-Specific Language (DSL)** for writing Parsers - A DSL is a customised language targeted to a particular kind of problem - HTML, Regular Expressions, Parsec - Parsec is an *internal* DSL, which uses the host language to give itself the feel of another language - Regex is an *external* DSL, independent of its programming language. - Parsec deals with the problem of *Parser Combinators* - defining basic character parsers and ways to combine them into complex parsers --- # Introduction Hackage: "Parsec is a *monad transformer* that can be stacked on arbitrary monads, and is parametric in the *input stream* type." - Monad Transformer? - Input Stream? --- # Monad Transformer *Monad Transformers are functions that accept a (different) monad, and wrap around it to make a bigger, better monad.* - E.g Add **Maybe** powers to **IO** computations - Exit an IO operation is Nothing is encountered - Naming convention: If I add Maybe powers, call the Monad Transformer MaybeT ```digit :: Stream s m Char => ParsecT s u m Char``` - ParsecT with stream type `s`, user state type `u`, underlying monad `m`, return type `a` - "Giving parsec powers" to underlying monad `m` --- # Input Stream ```digit :: Stream s m Char => ParsecT s u m Char``` - Will be explained in-depth later - Think of it as a data structure with a list of tokens - The parser (over `m`) can remove a single token at a time --- # Parsec's Libraries ![200 x 200](https://i.imgur.com/GmuxU2N.png) - Text.ParserCombinators is a legacy library, highlighted = covered in lecture --- # Parsec: Basics Some initial setup: ```haskell import Text.Parsec.String (Parser) import Text.Parsec regularParse :: Stream s Identity t => Parsec s () a -> s -> Either ParseError a regularParse p = parse p "" ``` --- ```haskell satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char ``` Satisfy is one of the primitive functions in Parsec. It looks at the next Char from the current input, and pops it if (Char -> Bool) returns true. ```haskell regularParse (satisfy (=='a')) "a" >> Right 'a' regularParse (satisfy (=='b')) "a" >> Left (line 1, column 1): >> unexpected "a" >> expecting "b" regularParse (satisfy (`elem` "abc")) "a" >> Right 'a' ``` --- We can use Satisfy to build our own rules, which is done in the original source code ```haskell -- | Parses a digit. Returns the parsed character. digit :: (Stream s m Char) => ParsecT s u m Char digit = (satisfy isDigit) <?> "digit" -- | Parses a white space character (any character which satisfies 'isSpace') -- Returns the parsed character. space :: (Stream s m Char) => ParsecT s u m Char space = (satisfy isSpace) <?> "space" = satisfy isSpace <?> "space" ``` - Source code for http://hackage.haskell.org/package/parsec-3.1.3/docs/src/Text-Parsec-Char.html --- ## Parse multiple characters ```haskell num :: Stream s m Char => ParsecT s u m [Char] num = many1 digit regularParse num "1" >> Right "1" regularParse num "123456a" >> Right "123456" ``` - Other than parsing individual characters, we also need to parse multiple characters - We do so using combinators, which would be further elaborated later on - `many1`: Parses 1 or more of the given parser --- ## Whitespace Whitespaces are generally ignored in most languages. Thus, we need a parser to consume and discard whitespaces. ```haskell whitespace = many $ oneOf " \n\t" regularParse whitespace " " >> Right " " num :: Stream s m Char => ParsecT s u m [Char] num = do whitespace e <- many1 digit whitespace return e regularParse num " 123 " >> Right "123" ``` --- ## Parentheses To parse an Integer within a pair of parentheses, we would do something like this: ```haskell parens :: Stream s m Char => ParsecT s u m [Char] parens = do whitespace char '(' whitespace e <- many1 digit whitespace char ')' whitespace return e regularParse parens " ( 123 ) " >> Right "123" ``` --- ## Lexeme ```haskell lexeme :: Stream s m Char => ParsecT s u m b -> ParsecT s u m b lexeme p = do x <- p whitespace return x ``` - 'lexeme' parsing - Every token parser should consume and ignore any trailing whitespace. - Allows skipping whitespace exactly once wherever it needs to be skipped - More elegant than spamming all the parsing code with many calls to whitespace --- ## Comparison ```haskell parens :: Stream s m Char => ParsecT s u m [Char] parens = do whitespace char '(' whitespace e <- many1 digit whitespace char ')' whitespace return e regularParse parens " ( 123 ) " >> Right "123" ``` ```haskell parensL :: Stream s m Char => ParsecT s u m [Char] parensL = do whitespace lexeme $ char '(' e <- lexeme $ many1 digit lexeme $ char ')' return e regularParse parensL " ( 123 ) " >> Right "123" ``` --- # Parsec: Combinators - Combinators make use of simple parser building blocks - Allow us to build up more complex parsers by combining simpler ones --- # Stream ```haskell class Monad m => Stream s m t | s -> t where uncons :: s -> m (Maybe (t, s)) ``` - An instance of Stream has stream type s, underlying monad m and token type t determined by the stream - Can be thought as a data structure holding a stream of tokens t - Has the instance `Monad m => Stream [tok] m tok` - Has the instance `Monad m => Stream ByteString m Char` - Responsible for maintaining the "position within the stream" in the stream state s which is trivial in many cases --- # ParsecT - `ParsecT s u m a` is a parser with stream type `s`, user state type `u`, underlying monad `m` and return type `a` - Has Functor, Applicative, Monad instance - For our demonstration, we can think of the stream type as `String`, user state as `()`, underlying monad as `Identity`, and a return type of `Char` ```haskell type Parsec s u = ParsecT s u Identity ``` --- # Parsing ```haskell regularParse :: Stream s Identity t => Parsec s () a -> s -> Either ParseError a regularParse p = parse p "" parseA :: Stream s m Char => ParsecT s u m Char parseA = char 'a' parseB :: Stream s m Char => ParsecT s u m Char parseB = char 'b' ``` --- # Combinators ```haskell many1 :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m [a] many1 p = do { x <- p; xs <- many p; return (x:xs) } parseManyA = many1 parseA regularParse parseManyA "aaaabaaaa" >> Right "aaaa" ``` ```haskell choice :: (Stream s m t) => [ParsecT s u m a] -> ParsecT s u m a choice ps = foldr (<|>) mzero ps parseAOrB = choice [parseA, parseB] regularParse parseAOrB "a" >> Right 'a' regularParse parseAOrB "b" >> Right 'b' regularParse parseAOrB "c" >> unexpected "c" expecting "a" or "b" ``` ```haskell chainl1 :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m (a -> a -> a) -> ParsecT s u m a chainl1 p op = do{ x <- p; rest x } where rest x = do { f <- op; y <- p; rest (f x y)} <|> return x ``` --- # The End --- <style> .reveal { font-size: 24px; } </style>