Haskell Project: More Parsers

:: haskell, tutorial, parsec

Going back to the todo.txt haskell project, this post will continue on with more advanced parsers and how to test them.

Previous project post: Intro to Parsers

Testing our simple example

Lets look back at our simple example of Priority. It is a single letter surrounded by parenthesis. For consistency we upper case the letter when we return our type Tasks.Priority.

-- |Priority: (A)
priority :: Parser Tasks.Priority
priority = do
  char '('
  p <- letter
  char ')'
  return $ toUpper p

The function we’ll be using to test our code is the parse command from Text.Parsec.Prim.

parse :: Stream s Identity t => Parsec s () a -> SourceName -> s -> Either ParseError a

To refresh on the Parsec type, our Parsec s () a definition is:

  • s is the stream type (String)
  • u is the user state type (Unit ())
  • a is the return type (Tasks.Priority)

SourceName is a String used to provide pretty errors. We pass in our stream value and get back either an error or our return type.

>>> Text.Parsec.Prim.parse priority "" "(A)"
Right 'A'
>>> Text.Parsec.Prim.parse priority "" "(Z)"
Right 'Z'
>>> Text.Parsec.Prim.parse priority "" ")A("
Left (line 1, column 1):
unexpected ")"
expecting "("
>>>

The next step is to create an automated unit test. Using HSpec (see the previous post for more unit testing details) we can create a few simple tests to make sure things are working.

{-# LANGUAGE OverloadedStrings #-}
module ParserSpec where

import Parser
import Tasks

import Text.Parsec.Prim (parse)
import Test.Hspec (Spec, describe, it, shouldBe, shouldNotBe)

-- | Required for auto-discovery
spec :: Spec
spec =
  describe "Parsers" $ do
    describe "Priority" $ do
      it "Matches (A) as Priority A" $ do
        parse priority "" "(A)" `shouldBe` Right ('A' :: Priority)

      it "Doesn't match A as Priority A" $ do
        parse priority "" "A" `shouldNotBe` Right ('A' :: Priority)

We are using the shouldBe and shouldNotBe functions from Test.Hspec.Expectations. In our test we verify that passing in a valid priority "(A)" works while an invalid priority does not. You should extend your tests to do more of a range, or at least upper and lower bounds to make the coverage better but for now we have a working parser and a unit test.

Variable length parser

Our example parser was pretty straight forward. It either matched the three characters or it didn’t. But in some cases you could have a varying description of what you would consider a valid token.

We will be allowing dates to be entered with the following rules:

  • 2 to 4 digit years, where 2 digits represent 20XX
  • 1 or 2 digit month and day

We’d consider 2017-09-01, 2017-9-1, and 17-9-1 all to be the same date. To do this we need to read each set of digits as individual pieces that vary in length.

-- |Date: 2017-02-23
-- Supports 2 or 4 digit year, and 1 or 2 digit month and day.
date :: Parser Tasks.Date
date = do
    year <- twoToFourDigits
    _ <- char '-'
    month <- oneToTwoDigits
    _ <- char '-'
    day <- oneToTwoDigits
    return $ Tasks.Date (convertYear $ read year) (read month) (read day)
  where oneToTwoDigits = do
          x <- digit
          y <- option ' ' $ digit
          return (x:y:[])
        twoToFourDigits = do
          w <- digit
          x <- digit
          y <- option ' ' $ digit
          z <- option ' ' $ digit
          return (w:x:y:z:[])
        convertYear x = if x < 100
                        then x + 2000
                        else x

In the main section of date we see three operations to parse the groups of digits along with two operations to throw out some hyphens. Pretty straight forward. Where the real magic happens is in our helper functions oneToTwoDigits and twoToFourDigits.

oneToTwoDigits = do
  x <- digit
  y <- option ' ' $ digit
  return (x:y:[])

The value for x is read as a digit. The value for y is given as an optional value, either as a digit or as the space character. option tries to apply the parser, and if it fails the optional value is returned.

In the failure case the stream data is not consumed, which is important. Think of how the month is handled. If we have the following date 2017-09-01 we expect x to be '0' and y to be '9'. When we go to parse the hyphen we are in good shape. But if we had the date 2017-9-1 and the y value consumed the character after '9' we would not get a valid parse of date.

The last step is returning an list of Char, either '0' : '9' : [] or '9' : ' ' : [] The function read trims off white space before converting the string to the return type, so no need to differentiate between the two cases.

Making choices

In our previous example we had variation that was homomorphic. Our token changed but it was always just expecting more or less digits. What if we wanted to have variations on an token that were much more diverse?

Choice of tokens

Projects in todo.txt are defined as a CamelCased, alpha-numeric string starting with a plus sign. In many implementations of todo.txt, the idea of sub-projects was introduce by using either a hyphen or dot to break up the project/sub-projects.

Lets think about what the valid characters are in a project. We start with a plus sign +. We then expect alpha-numeric characters if we are just dealing with a project. If we are working with a sub-project then we’d have optional hyphens and dots. In the sub-project case, we still need to have the main project name exist, before we have a hyphen/dot and our sub-project name.

-- |Project: +ProjectName +Project.SubProject +Project-Sub-SubSub
project :: Parser Tasks.StringTypes
project = do
    _ <- char '+'
    c <- alphaNum
    s <- many alphaNumDashDot
    _ <- whiteSpace
    return $ Tasks.SProject ([c] ++ s)
  where
    alphaNumDashDot = choice [alphaNum, oneOf "-."]

We parse out the +, and then ask for a single alpha numeric character. This would provide the minimum for our project name.

s <- many alphaNumDashDot

This line says that we expect multiple alphaNumDashDot (which we define later). The many call returns a list with zero or more of the tokenized atoms it parses. (many1 requires one or more).

alphaNumDashDot = choice [alphaNum, oneOf "-."]

Our alphaNumDashDot is defined as a choice between multiple parsers. The first we know, alphaNum. The second one, oneOf takes a string of characters and attempts to parse one of its characters from your stream.

The next step is to write some unit tests.

describe "Project" $ do
  it "Matches +ProjectName" $ do
    parse Parser.project "" "+ProjectName" `shouldBe`
      Right (Tasks.SProject "ProjectName")

  it "Doesn't match ProjectName" $ do
    parse Parser.project "" "ProjectName" `shouldNotBe`
      Right (Tasks.SProject "ProjectName")

  it "Doesn't match @ProjectName" $ do
    parse Parser.project "" "@ProjectName" `shouldNotBe`
      Right (Tasks.SProject "ProjectName")

  it "Matches +ProjectName.SubProjectName" $ do
    parse Parser.project "" "+ProjectName.SubProjectName" `shouldBe`
      Right (Tasks.SProject "ProjectName.SubProjectName")

  it "Matches +ProjectName-SubProjectName" $ do
    parse Parser.project "" "+ProjectName-SubProjectName" `shouldBe`
      Right (Tasks.SProject "ProjectName-SubProjectName")

  it "Doesn't match +ProjectName..SubProjectName" $ do
    parse Parser.project "" "+ProjectName..SubProjectName" `shouldNotBe`
      Right (Tasks.SProject "ProjectName..SubProjectName")

  it "Doesn't match +ProjectName--SubProjectName" $ do
    parse Parser.project "" "+ProjectName--SubProjectName" `shouldNotBe`
      Right (Tasks.SProject "ProjectName--SubProjectName")

We run our tests and…

Failures:

  test/ParserSpec.hs:51:
  1) Parser.Parsers.Project Doesn't match +ProjectName..SubProjectName
       not expected: Right +ProjectName..SubProjectName

  test/ParserSpec.hs:54:
  2) Parser.Parsers.Project Doesn't match +ProjectName--SubProjectName
       not expected: Right +ProjectName--SubProjectName

Our code works initially, but it doesn’t prevent us from having multiple hyphens or dots together with no sub-projects between them. We can rewrite this parser so that we expect a project and zero or more sub-projects.

project :: Parser Tasks.StringTypes
project = do
    _ <- char '+'
    p <- many1 alphaNum
    s <- many subProjects
    _ <- whiteSpace
    return $ Tasks.SProject (concat $ p : s)
  where
    subProjects = do
      x <- oneOf "-."
      sub <- many1 alphaNum
      return $ x : sub

Our value p contains one or more alpha-numeric characters, and s contains a list of zero or more sub-projects. We concatenate those together to make the single string and pass it to the constructor.

Putting many, many1, choice, oneOf, and noneOf together allows us to create a fairly complex token from very fundamental components. Inside the ParsecT monad we can write a lot of custom logic, but you’ll tend to find that the fundamental operations mimic a lot of what regular expression operations do which will get you a long way.

Choices with Union Types

In our code we break down the words in our task into different types of strings. Projects are strings that start with +, while Context are strings starting with @. We defined a Key/Value pair by having two strings with colon : in the middle. Each of these are part of Tasks.StringTypes.

To make the logic in our code easier to understand, we want to group all of our union types into a single parser. This way we can say ask for multiple string types and get back Context, Projects, Key/Values and just plain old strings. Makes for some pretty simple code. Using the choice function we can use our created parsers to increase the complexity.

project :: Parser Tasks.StringTypes
context :: Parser Tasks.StringTypes
keyvalue :: Parser Tasks.StringTypes
other :: Parser Tasks.StringTypes

-- |Parse all string types
stringTypes :: Parser Tasks.StringTypes
stringTypes = choice [ project
                     , context
                     , keyvalue
                     , other
                     ]

Order is important here. Since Project and Context start with a special character which can be found inside any other string, they must be the first choices with other strings as fall back. Key/Value Pairs have a special character too so we want to parse it before we consider the word to just be a plain old string.

We can go a step further and define a parser that gives us all the different string types in a list.

-- |Parse a lot of string types
lots :: Parser [Tasks.StringTypes]
lots = do
  l <- many1 stringTypes
  return l

Here we parse many1 (one or more) stringTypes and return a list.

Complex Tokens

This is the point when it should be very obvious that each of our parsers are just a grouping of other, smaller parsers. We increase complexity by building in stages. Lets look at our incomplete task parser.

-- |Incomplete Task
-- This supports an optional priority, and an optional start date.
incompleteTask :: Parser Tasks.Task
incompleteTask = do
  pri <- optionMaybe priority
  _ <- whiteSpace
  startDate <- optionMaybe date
  _ <- whiteSpace
  rest <- lots
  _ <- many endOfLine
  return $ Tasks.Incomplete pri startDate rest

First lets notice how extremely compact it is, and how it reads pretty easily. Without even looking at the other functions you can easily see what it does.

pri <- optionMaybe priority

The function optionMaybe is similar to option. It attempts to match a token based on the given parser, but instead of taking an optional value, it uses the Maybe monad to return a Nothing or Just a.

_ <- whiteSpace
startDate <- optionMaybe date
_ <- whiteSpace

Toss some white space, check for a date and return Maybe Date.

rest <- lots

We have passed all the optional tokens and are now gathering all of the string types left. With our gradually increasing complex parser we just make one simple function call and we get back a list of a bunch of complex tokens.

We group together incomplete and completed tasks since they both will reside in the same text file.

task :: Parser Tasks.Task
task = do
  _ <- try (many (char '\n'))
  t <- choice [ completedTask, incompleteTask ]
  return t

And we parse a list of tasks (completed and incomplete).

tasks :: Parser [Tasks.Task]
tasks = do
  tx <- option [] . try $ many task
  return tx

Notice how we handle parsing a list of tasks. We request many and get back a list of zero or more. This would be fine except for cases when an error occurs in the parsing. We could throw an exception, have tons of handlers for calling tasks, or we could handle Either in a more idiomatic Haskell way. But in our code we don’t care to display the error, we just want to make a new, empty list.

We call try which runs a parser as usual. But in cases where it fails, it acts as if no parsing took place at all. We call option and supply it with the default empty list and now if we fail to parse we just return an empty list.

Put it all together

In the end we want to boil this whole process down to a single function. We can read lines from a file, pass it to our parser and in return we get a list of Tasks.Task.

parseLines :: String -> String -> Either Parser.ParseError [Tasks.Task]
parseLines path lns = parse tasks path (pack lns)

References