Haskell Project: More Parsers
Going back to the todo.txt haskell project, this post will continue on with more advanced parsers and how to test them.
Previous project post: Intro to Parsers
Testing our simple example
Lets look back at our simple example of Priority. It is a single letter surrounded by parenthesis. For consistency we upper case the letter when we return our type Tasks.Priority
.
-- |Priority: (A) priority :: Parser Tasks.Priority priority = do char '(' p <- letter char ')' return $ toUpper p
The function we’ll be using to test our code is the parse
command from Text.Parsec.Prim.
parse :: Stream s Identity t => Parsec s () a -> SourceName -> s -> Either ParseError a
To refresh on the Parsec type, our Parsec s () a
definition is:
s
is the stream type (String
)u
is the user state type (Unit()
)a
is the return type (Tasks.Priority
)
SourceName
is a String
used to provide pretty errors. We pass in our stream value and get back either an error or our return type.
>>> Text.Parsec.Prim.parse priority "" "(A)" Right 'A' >>> Text.Parsec.Prim.parse priority "" "(Z)" Right 'Z' >>> Text.Parsec.Prim.parse priority "" ")A(" Left (line 1, column 1): unexpected ")" expecting "(" >>>
The next step is to create an automated unit test. Using HSpec (see the previous post for more unit testing details) we can create a few simple tests to make sure things are working.
{-# LANGUAGE OverloadedStrings #-} module ParserSpec where import Parser import Tasks import Text.Parsec.Prim (parse) import Test.Hspec (Spec, describe, it, shouldBe, shouldNotBe) -- | Required for auto-discovery spec :: Spec spec = describe "Parsers" $ do describe "Priority" $ do it "Matches (A) as Priority A" $ do parse priority "" "(A)" `shouldBe` Right ('A' :: Priority) it "Doesn't match A as Priority A" $ do parse priority "" "A" `shouldNotBe` Right ('A' :: Priority)
We are using the shouldBe
and shouldNotBe
functions from Test.Hspec.Expectations. In our test we verify that passing in a valid priority "(A)" works while an invalid priority does not. You should extend your tests to do more of a range, or at least upper and lower bounds to make the coverage better but for now we have a working parser and a unit test.
Variable length parser
Our example parser was pretty straight forward. It either matched the three characters or it didn’t. But in some cases you could have a varying description of what you would consider a valid token.
We will be allowing dates to be entered with the following rules:
- 2 to 4 digit years, where 2 digits represent 20XX
- 1 or 2 digit month and day
We’d consider 2017-09-01
, 2017-9-1
, and 17-9-1
all to be the same date. To do this we need to read each set of digits as individual pieces that vary in length.
-- |Date: 2017-02-23 -- Supports 2 or 4 digit year, and 1 or 2 digit month and day. date :: Parser Tasks.Date date = do year <- twoToFourDigits _ <- char '-' month <- oneToTwoDigits _ <- char '-' day <- oneToTwoDigits return $ Tasks.Date (convertYear $ read year) (read month) (read day) where oneToTwoDigits = do x <- digit y <- option ' ' $ digit return (x:y:[]) twoToFourDigits = do w <- digit x <- digit y <- option ' ' $ digit z <- option ' ' $ digit return (w:x:y:z:[]) convertYear x = if x < 100 then x + 2000 else x
In the main section of date
we see three operations to parse the groups of digits along with two operations to throw out some hyphens. Pretty straight forward. Where the real magic happens is in our helper functions oneToTwoDigits
and twoToFourDigits
.
oneToTwoDigits = do x <- digit y <- option ' ' $ digit return (x:y:[])
The value for x
is read as a digit
. The value for y
is given as an optional value, either as a digit
or as the space character. option
tries to apply the parser, and if it fails the optional value is returned.
In the failure case the stream data is not consumed, which is important. Think of how the month is handled. If we have the following date 2017-09-01
we expect x
to be '0'
and y
to be '9'
. When we go to parse the hyphen we are in good shape. But if we had the date 2017-9-1
and the y
value consumed the character after '9'
we would not get a valid parse of date.
The last step is returning an list of Char
, either '0' : '9' : []
or '9' :
' ' : []
The function read
trims off white space before converting the string to the return type, so no need to differentiate between the two cases.
Making choices
In our previous example we had variation that was homomorphic. Our token changed but it was always just expecting more or less digits. What if we wanted to have variations on an token that were much more diverse?
Choice of tokens
Projects in todo.txt are defined as a CamelCased, alpha-numeric string starting with a plus sign. In many implementations of todo.txt, the idea of sub-projects was introduce by using either a hyphen or dot to break up the project/sub-projects.
Lets think about what the valid characters are in a project. We start with a plus sign +
. We then expect alpha-numeric characters if we are just dealing with a project. If we are working with a sub-project then we’d have optional hyphens and dots. In the sub-project case, we still need to have the main project name exist, before we have a hyphen/dot and our sub-project name.
-- |Project: +ProjectName +Project.SubProject +Project-Sub-SubSub project :: Parser Tasks.StringTypes project = do _ <- char '+' c <- alphaNum s <- many alphaNumDashDot _ <- whiteSpace return $ Tasks.SProject ([c] ++ s) where alphaNumDashDot = choice [alphaNum, oneOf "-."]
We parse out the +
, and then ask for a single alpha numeric character. This would provide the minimum for our project name.
s <- many alphaNumDashDot
This line says that we expect multiple alphaNumDashDot
(which we define later). The many
call returns a list with zero or more of the tokenized atoms it parses. (many1
requires one or more).
alphaNumDashDot = choice [alphaNum, oneOf "-."]
Our alphaNumDashDot
is defined as a choice between multiple parsers. The first we know, alphaNum
. The second one, oneOf
takes a string of characters and attempts to parse one of its characters from your stream.
The next step is to write some unit tests.
describe "Project" $ do it "Matches +ProjectName" $ do parse Parser.project "" "+ProjectName" `shouldBe` Right (Tasks.SProject "ProjectName") it "Doesn't match ProjectName" $ do parse Parser.project "" "ProjectName" `shouldNotBe` Right (Tasks.SProject "ProjectName") it "Doesn't match @ProjectName" $ do parse Parser.project "" "@ProjectName" `shouldNotBe` Right (Tasks.SProject "ProjectName") it "Matches +ProjectName.SubProjectName" $ do parse Parser.project "" "+ProjectName.SubProjectName" `shouldBe` Right (Tasks.SProject "ProjectName.SubProjectName") it "Matches +ProjectName-SubProjectName" $ do parse Parser.project "" "+ProjectName-SubProjectName" `shouldBe` Right (Tasks.SProject "ProjectName-SubProjectName") it "Doesn't match +ProjectName..SubProjectName" $ do parse Parser.project "" "+ProjectName..SubProjectName" `shouldNotBe` Right (Tasks.SProject "ProjectName..SubProjectName") it "Doesn't match +ProjectName--SubProjectName" $ do parse Parser.project "" "+ProjectName--SubProjectName" `shouldNotBe` Right (Tasks.SProject "ProjectName--SubProjectName")
We run our tests and…
Failures:
test/ParserSpec.hs:51:
1) Parser.Parsers.Project Doesn't match +ProjectName..SubProjectName
not expected: Right +ProjectName..SubProjectName
test/ParserSpec.hs:54:
2) Parser.Parsers.Project Doesn't match +ProjectName--SubProjectName
not expected: Right +ProjectName--SubProjectName
Our code works initially, but it doesn’t prevent us from having multiple hyphens or dots together with no sub-projects between them. We can rewrite this parser so that we expect a project and zero or more sub-projects.
project :: Parser Tasks.StringTypes project = do _ <- char '+' p <- many1 alphaNum s <- many subProjects _ <- whiteSpace return $ Tasks.SProject (concat $ p : s) where subProjects = do x <- oneOf "-." sub <- many1 alphaNum return $ x : sub
Our value p
contains one or more alpha-numeric characters, and s
contains a list of zero or more sub-projects. We concatenate those together to make the single string and pass it to the constructor.
Putting many
, many1
, choice
, oneOf
, and noneOf
together allows us to create a fairly complex token from very fundamental components. Inside the ParsecT
monad we can write a lot of custom logic, but you’ll tend to find that the fundamental operations mimic a lot of what regular expression operations do which will get you a long way.
Choices with Union Types
In our code we break down the words in our task into different types of strings. Projects are strings that start with +
, while Context are strings starting with @
. We defined a Key/Value pair by having two strings with colon :
in the middle. Each of these are part of Tasks.StringTypes
.
To make the logic in our code easier to understand, we want to group all of our union types into a single parser. This way we can say ask for multiple string types and get back Context, Projects, Key/Values and just plain old strings. Makes for some pretty simple code. Using the choice
function we can use our created parsers to increase the complexity.
project :: Parser Tasks.StringTypes context :: Parser Tasks.StringTypes keyvalue :: Parser Tasks.StringTypes other :: Parser Tasks.StringTypes -- |Parse all string types stringTypes :: Parser Tasks.StringTypes stringTypes = choice [ project , context , keyvalue , other ]
Order is important here. Since Project and Context start with a special character which can be found inside any other string, they must be the first choices with other strings as fall back. Key/Value Pairs have a special character too so we want to parse it before we consider the word to just be a plain old string.
We can go a step further and define a parser that gives us all the different string types in a list.
-- |Parse a lot of string types lots :: Parser [Tasks.StringTypes] lots = do l <- many1 stringTypes return l
Here we parse many1
(one or more) stringTypes
and return a list.
Complex Tokens
This is the point when it should be very obvious that each of our parsers are just a grouping of other, smaller parsers. We increase complexity by building in stages. Lets look at our incomplete task parser.
-- |Incomplete Task -- This supports an optional priority, and an optional start date. incompleteTask :: Parser Tasks.Task incompleteTask = do pri <- optionMaybe priority _ <- whiteSpace startDate <- optionMaybe date _ <- whiteSpace rest <- lots _ <- many endOfLine return $ Tasks.Incomplete pri startDate rest
First lets notice how extremely compact it is, and how it reads pretty easily. Without even looking at the other functions you can easily see what it does.
pri <- optionMaybe priority
The function optionMaybe
is similar to option
. It attempts to match a token based on the given parser, but instead of taking an optional value, it uses the Maybe
monad to return a Nothing
or Just a
.
_ <- whiteSpace startDate <- optionMaybe date _ <- whiteSpace
Toss some white space, check for a date and return Maybe Date
.
rest <- lots
We have passed all the optional tokens and are now gathering all of the string types left. With our gradually increasing complex parser we just make one simple function call and we get back a list of a bunch of complex tokens.
We group together incomplete and completed tasks since they both will reside in the same text file.
task :: Parser Tasks.Task task = do _ <- try (many (char '\n')) t <- choice [ completedTask, incompleteTask ] return t
And we parse a list of tasks (completed and incomplete).
tasks :: Parser [Tasks.Task] tasks = do tx <- option [] . try $ many task return tx
Notice how we handle parsing a list of tasks. We request many
and get back a list of zero or more. This would be fine except for cases when an error occurs in the parsing. We could throw an exception, have tons of handlers for calling tasks
, or we could handle Either
in a more idiomatic Haskell way. But in our code we don’t care to display the error, we just want to make a new, empty list.
We call try
which runs a parser as usual. But in cases where it fails, it acts as if no parsing took place at all. We call option
and supply it with the default empty list and now if we fail to parse we just return an empty list.
Put it all together
In the end we want to boil this whole process down to a single function. We can read lines from a file, pass it to our parser and in return we get a list of Tasks.Task
.
parseLines :: String -> String -> Either Parser.ParseError [Tasks.Task] parseLines path lns = parse tasks path (pack lns)