Skip to main content

What if parsing was ...? Aspirations to a world that is easier to parse.


Parsing is one of those activities that never goes away. It comes up in a surprisingly diverse range of programmer tasks, from compiler writing, to log analysis, to API communication. It's been studied ad nauseum by computer scientists and enormous numbers of important theoretical results are known about every corner of the discipline. And yet, the tools that we use on a day to day basis as practitioners seem strangely inflexible, weak and brittle.

The most widely used parsing tools are regexps. And of course, regexps are great for what they do. If what you need is to recognise a regular language, or perhaps even extract some information from a regular language, then you've got the perfect tool for the job. Not only that, but most languages give you a fast and convenient way to do it.

Beyond that however, thinks start getting fairly arcane. There are of course good tools out there, all of which make slightly different assumptions about what you might like to parse and how to go about doing it.

Among the various tools that I've used, the ones that I've been most comfortable with using, and found the most flexible, are monadic-parser-combinator libraries for functional programming languages such as Parsec and the DCG(Declarative Clause Grammar) style parsers of prolog.

Parsec and friends are really great general tools, but from a readability and writability perspective, I actually prefer DCGs. I'd recommend them wholeheartedly if it weren't for a few basic problems that render them much less usable in practice, than as a toy.

What do these DCG's look like and how would they compare to a parser combinator approach? Well let's see a parser in the wild written as a DCG.

digit("0") --> "0".
digit("1") --> "1".
digit("2") --> "2".
digit("3") --> "3".
digit("4") --> "4".
digit("5") --> "5".
digit("6") --> "6".
digit("7") --> "7".
digit("8") --> "8".
digit("9") --> "9".

DCG's are written as productions. In the case above for example, you succeed as a digit, if you're able to do one of the things to the right of the arrow of one of the 'digit(X)' clauses. There are 10 clauses, one for each of the base 10 number system we are familiar with. In actual usage, one will apply this predicate to an input stream as:

phrase(digit(X), ['0']).

In which case the unbound X would be bound to '0' as the successful digit would be when it was able to consume a '0' from the input list (which here has only one character). In this way we can not only consume the input string in ways that depend on the productions, but we can build up auxiliary structures as we match the stream on our way through, perhaps building up an AST if we are parsing a programming language, or a structured record which will inform of us of a request in the case of an API call.

oneDigitNatural(N) --> digit(A), { number_string(N,A) }.
 
twoDigitNatural(N) --> digit(A), digit(B), { string_concat(A,B,C), number_string(N,C) } .

threeDigitNatural(N) --> digit(A), digit(B), digit(C), 
                         { string_concat(A,B,I), string_concat(I,C,D), number_string(N,D) } . 

fourDigitNatural(N) --> digit(A), digit(B), digit(C), digit(D),
   { string_concat(A,B,S1), string_concat(S1,C,S2), 
                          string_concat(S2,D,S3), number_string(N,S3) } . 
    
digits(S) --> digit(S) .
digits(T) --> digit(X), digits(S),
       { string_concat(X, S, T) } .

natural(N) --> digits(S),
        { number_string(N,S) }.

And we don't have to stop with simply recognising as we go along as with regexps. We can structure these productions recursively as we see here with 'digits(X)', which accepts either a digit, or a digit followed by another digit, and then, using the '{','}' characters, breaks from our DCG DSL into standard prolog where it concatenates the input characters to a string.

The external interface is the 'natural(X)' predicate which simply takes these digits and converts them into the internal prolog representation of numbers, and binds the argument of the predicate to that number. This enables us to consume a list of numeric characters, and return an integer.

So what's wrong with DCGs and why aren't they more widely used?


The first problem is that they are only a thin syntactic veneer on prolog, and are not designed to carry around the kind of information that is often needed while parsing in practice. The monad we get is the list monad as that's what comes out of the box with prolog. It is more than just a simple monad, as you get disjunction, and sometimes this is called monad-plus. But you don't carry around the character number and line number of failure or the potential cause of failure. Instead failure to parse is simply silent failure with no explanation as to why. It's simply too anemic in its reporting to use as, for instance, a parser for a programming language.

This means that you end up writing your own parsing library in prolog, and it ends up feeling a lot like implementing one anywhere else. Not a great state of affairs and less useful than our slightly more awkward parser combinator libraries. In prolog got the easy stuff easily, but the hard stuff become nearly impossible. whereas with Parsec, everything is a bit more ugly, but we get what we need in practice.

DCG's in prolog are also written for traversing difference lists. It doesn't seems necessary or sensible to convert all strings to lists of characters in order to parse them. I've often wondered why, given that we can assume parsing operations on strings can not mutate the string, we don't merely keep track of the current character in a shared string buffer.

The relational approach to programming which prolog makes available enticingly dangles in front of us the possibility of reversability, only to steal it away from us again. Productions should not only give us a way to parse an input stream, but they should also be generative. They should be able to give us streams that are possible to parse. This would mean we could get a pretty printer from the same code that we use to parse!

Unfortunately two things intervene to spoil the fun. One is that prolog clauses can not be reordered, but are merely run in the order textually presented. This means that running predicates "backwards" often simply does not work. This is exacerbated by the lack of attention to mode information.

The second problem, is that with recursion in prolog, we have to be very careful that we order things such that we get terminating solutions. It's quite possible to run things backwards only to find the stream we are building is infinitely large. Think for instance of running the regex 'a*' backwards as a generator. Which stream do we give first? Hopefully not the one with infinite number of 'a' characters!

Lastly, in our DCGs we may also have a lot of statically provided information that is not exploited to remove dead branches and optimise our parser. This is a very common scenario in practice and it would be a shame if we could not exploit it to make our parsers perform well.

What does my utopia look like?


How could we solve these problems? Unfortunately I don't have complete answers to this, but I've some vague notions that I'd like to put down in order to attempt greater clarity.

The first problem as I see it, is that we want DSLs that allow us to obtain various forms of non-determinism with a syntax which is not too painful. Unfortunately prolog succeeds on the front of being nice syntactically but unable to give us anything easily other than the list monad. Parsec et al. give us an only slightly less nice syntax, but more flexible success / failure. Unfortunately it also suffers from not having Prologs friendly unification and it likewise can not be reordered or run backwards as a generator.

What we need is a new DSL which allows us to effeciently read streams, specify reversible parsing which can be both optimised and reordered in as declarative a fashion as possible, while retaining information about how the causes of failure. Is such a thing possible?

Stay tuned for the following blog-post where I'll talk about how we might do this.

Comments

Popular posts from this blog

Decidable Equality in Agda

So I've been playing with typing various things in System-F which previously I had left with auxiliary well-formedness conditions. This includes substitutions and contexts, both of which are interesting to have well typed versions of. Since I've been learning Agda, it seemed sensible to carry out this work in that language, as there is nothing like a problem to help you learn a language. In the course of proving properties, I ran into the age old problem of showing that equivalence is decidable between two objects. In this particular case, I need to be able to show the decidability of equality over types in System F in order to have formation rules for variable contexts. We'd like a context Γ to have (x:A) only if (x:B) does not occur in Γ when (A ≠ B). For us to have statements about whether two types are equal or not, we're going to need to be able to decide if that's true using a terminating procedure. And so we arrive at our story. In Coq, equality is ...

Teagrey

I was ironing my shirt today, which I almost never do. Because of this I don't have an ironing board so I tried to make a make-shift ironing board on my floor using a towel and some books. I grabbed the heaviest books on the nearest shelf, which happened to be Organic Chemistry, Stalingrad and an annotated study bible containing the old and new testament. As I pulled out the bible, a flower fell out which had been there for over 17 years now. I know that because it was put there by my first wife, Daniel, who killed herself in April about 17 years ago. It fell from Thessalonians to which it had been opened partially. Highlighted was the passage: "Ye are all sons of the light and sons of the day." I guess the passage gave her solace. Daniel was a complicated woman. She had serious mental health issues which plagued her for her entire life. None of them were her fault. She was dealt an absolutely awful hand in life, some truly nasty cards. She had some considerable c...

Total Functional Programming

Recently on Lambda the Ultimate there was a post called "Total Functional Programming".  The title didn't catch my eye particularly, but I tend to read the abstract of any paper that is posted there that doesn't sound terribly boring.  I've found that this is a fairly good strategy since I tend to get very few false positives this way and I'm too busy for false negatives. The paper touches directly and indirectly on subjects I've posted about here before.  The idea is basically to eschew the current dogma that programming languages should be Turing-complete, and run with the alternative to the end of supplying "Total Functional Programming" .  At first glance this might seem to be a paper aimed at "hair-shirt" wearing functional programming weenies.  My reading was somewhat different. Most hobbiest mathematicians have some passing familiarity with "Turing's Halting Problem" and the related "Goedel's Inco...