Skip to main content

What if parsing was ...? Aspirations to a world that is easier to parse.

Parsing is one of those activities that never goes away. It comes up in a surprisingly diverse range of programmer tasks, from compiler writing, to log analysis, to API communication. It's been studied ad nauseum by computer scientists and enormous numbers of important theoretical results are known about every corner of the discipline. And yet, the tools that we use on a day to day basis as practitioners seem strangely inflexible, weak and brittle.

The most widely used parsing tools are regexps. And of course, regexps are great for what they do. If what you need is to recognise a regular language, or perhaps even extract some information from a regular language, then you've got the perfect tool for the job. Not only that, but most languages give you a fast and convenient way to do it.

Beyond that however, thinks start getting fairly arcane. There are of course good tools out there, all of which make slightly different assumptions about what you might like to parse and how to go about doing it.

Among the various tools that I've used, the ones that I've been most comfortable with using, and found the most flexible, are monadic-parser-combinator libraries for functional programming languages such as Parsec and the DCG(Declarative Clause Grammar) style parsers of prolog.

Parsec and friends are really great general tools, but from a readability and writability perspective, I actually prefer DCGs. I'd recommend them wholeheartedly if it weren't for a few basic problems that render them much less usable in practice, than as a toy.

What do these DCG's look like and how would they compare to a parser combinator approach? Well let's see a parser in the wild written as a DCG.

digit("0") --> "0".
digit("1") --> "1".
digit("2") --> "2".
digit("3") --> "3".
digit("4") --> "4".
digit("5") --> "5".
digit("6") --> "6".
digit("7") --> "7".
digit("8") --> "8".
digit("9") --> "9".

DCG's are written as productions. In the case above for example, you succeed as a digit, if you're able to do one of the things to the right of the arrow of one of the 'digit(X)' clauses. There are 10 clauses, one for each of the base 10 number system we are familiar with. In actual usage, one will apply this predicate to an input stream as:

phrase(digit(X), ['0']).

In which case the unbound X would be bound to '0' as the successful digit would be when it was able to consume a '0' from the input list (which here has only one character). In this way we can not only consume the input string in ways that depend on the productions, but we can build up auxiliary structures as we match the stream on our way through, perhaps building up an AST if we are parsing a programming language, or a structured record which will inform of us of a request in the case of an API call.

oneDigitNatural(N) --> digit(A), { number_string(N,A) }.
twoDigitNatural(N) --> digit(A), digit(B), { string_concat(A,B,C), number_string(N,C) } .

threeDigitNatural(N) --> digit(A), digit(B), digit(C), 
                         { string_concat(A,B,I), string_concat(I,C,D), number_string(N,D) } . 

fourDigitNatural(N) --> digit(A), digit(B), digit(C), digit(D),
   { string_concat(A,B,S1), string_concat(S1,C,S2), 
                          string_concat(S2,D,S3), number_string(N,S3) } . 
digits(S) --> digit(S) .
digits(T) --> digit(X), digits(S),
       { string_concat(X, S, T) } .

natural(N) --> digits(S),
        { number_string(N,S) }.

And we don't have to stop with simply recognising as we go along as with regexps. We can structure these productions recursively as we see here with 'digits(X)', which accepts either a digit, or a digit followed by another digit, and then, using the '{','}' characters, breaks from our DCG DSL into standard prolog where it concatenates the input characters to a string.

The external interface is the 'natural(X)' predicate which simply takes these digits and converts them into the internal prolog representation of numbers, and binds the argument of the predicate to that number. This enables us to consume a list of numeric characters, and return an integer.

So what's wrong with DCGs and why aren't they more widely used?

The first problem is that they are only a thin syntactic veneer on prolog, and are not designed to carry around the kind of information that is often needed while parsing in practice. The monad we get is the list monad as that's what comes out of the box with prolog. It is more than just a simple monad, as you get disjunction, and sometimes this is called monad-plus. But you don't carry around the character number and line number of failure or the potential cause of failure. Instead failure to parse is simply silent failure with no explanation as to why. It's simply too anemic in its reporting to use as, for instance, a parser for a programming language.

This means that you end up writing your own parsing library in prolog, and it ends up feeling a lot like implementing one anywhere else. Not a great state of affairs and less useful than our slightly more awkward parser combinator libraries. In prolog got the easy stuff easily, but the hard stuff become nearly impossible. whereas with Parsec, everything is a bit more ugly, but we get what we need in practice.

DCG's in prolog are also written for traversing difference lists. It doesn't seems necessary or sensible to convert all strings to lists of characters in order to parse them. I've often wondered why, given that we can assume parsing operations on strings can not mutate the string, we don't merely keep track of the current character in a shared string buffer.

The relational approach to programming which prolog makes available enticingly dangles in front of us the possibility of reversability, only to steal it away from us again. Productions should not only give us a way to parse an input stream, but they should also be generative. They should be able to give us streams that are possible to parse. This would mean we could get a pretty printer from the same code that we use to parse!

Unfortunately two things intervene to spoil the fun. One is that prolog clauses can not be reordered, but are merely run in the order textually presented. This means that running predicates "backwards" often simply does not work. This is exacerbated by the lack of attention to mode information.

The second problem, is that with recursion in prolog, we have to be very careful that we order things such that we get terminating solutions. It's quite possible to run things backwards only to find the stream we are building is infinitely large. Think for instance of running the regex 'a*' backwards as a generator. Which stream do we give first? Hopefully not the one with infinite number of 'a' characters!

Lastly, in our DCGs we may also have a lot of statically provided information that is not exploited to remove dead branches and optimise our parser. This is a very common scenario in practice and it would be a shame if we could not exploit it to make our parsers perform well.

What does my utopia look like?

How could we solve these problems? Unfortunately I don't have complete answers to this, but I've some vague notions that I'd like to put down in order to attempt greater clarity.

The first problem as I see it, is that we want DSLs that allow us to obtain various forms of non-determinism with a syntax which is not too painful. Unfortunately prolog succeeds on the front of being nice syntactically but unable to give us anything easily other than the list monad. Parsec et al. give us an only slightly less nice syntax, but more flexible success / failure. Unfortunately it also suffers from not having Prologs friendly unification and it likewise can not be reordered or run backwards as a generator.

What we need is a new DSL which allows us to effeciently read streams, specify reversible parsing which can be both optimised and reordered in as declarative a fashion as possible, while retaining information about how the causes of failure. Is such a thing possible?

Stay tuned for the following blog-post where I'll talk about how we might do this.


Popular posts from this blog

Generating etags automatically when needed

Have you ever wanted M-. (the emacs command which finds the definition of the term under the cursor) to just "do the right thing" and go to the most current definition site, but were in a language that didn't have an inferior process set-up to query about source locations correctly (as is done in lisp, ocaml and some other languages with sophisticated emacs interfaces)?

Well, fret no more. Here is an approach that will let you save the appropriate files and regenerate your TAGS file automatically when things change assuring that M-. takes you to the appropriate place.

You will have to reset the tags-table-list or set it when you first use M-. and you'll want to change the language given to find and etags in the 'create-prolog-tags function (as you're probably not using prolog), but otherwise it shouldn't require much customisation.

And finally, you will need to run etags once manually, or run 'M-x create-prolog-tags' in order to get the initia…

Managing state in Prolog monadically, using DCGs.

Prolog is a beautiful language which makes a lot of irritating rudimentary rule application and search easy. I have found it is particularly nice when trying to deal with compilers which involve rule based transformation from a source language L to a target language L'.

However, the management of these rules generally requires keeping track of a context, and this context has to be explicitly threaded through the entire application, which involves a lot of irritating and error prone sequence variables. This often leads to your code looking something a bit like this:

compile(seq(a,b),(ResultA,ResultB),S0,S2) :- compile(a,ResultA,S0,S1), compile(b,ResultB,S1,S2).
While not the worst thing, I've found it irritating and ugly, and I've made a lot of mistakes with incorrectly sequenced variables. It's much easier to see sequence made explicitly textually in the code.

While they were not designed for this task, but rather for parsing, DCGs turn out to be a convenient …

Formalisation of Tables in a Dependent Language

I've had an idea kicking about in my head for a while of making query plans explicit in SQL in such a way that one can be assured that the query plan corresponds to the SQL statement desired. The idea is something like a Curry-Howard in a relational setting. One could infer the plan from the SQL, the SQL from the plan, or do a sort of "type-checking" to make sure that the plan corresponds to the SQL.

The devil is always in the details however. When I started looking at the primitives that I would need, it turns out that the low level table joining operations are actually not that far from primitive SQL statement themselves. I decided to go ahead and formalise some of what would be necessary in Agda in order get a better feel for the types of objects I would need and the laws which would be required to demonstrate that a plan corresponded with a statement.

Dependent types are very powerful and give you plenty of rope to hang yourself. It's always something of…