Skip to main content

Strategies for automatic translation

I've been hacking away at my program to test a theory I have about machine translation. I wrote a bit about it in a previous post but I was fairly vague. I thought I'd describe in more detail exactly how the technique would work (I'm still in phase 1).

The idea is simple. The first phase is to take a corpus in a language. Take each sentence of the source (or some other sized chunk, currently I'm limited by computational tractability to a single sentence) and recombine each element of the sentence into every possible string of n-grams. If you play with it a bit you'll realise that there are 2(N-1) of these for a string of size N. One way to think about it is that there are N-1 indexes into the spaces between words in the string. You can then think of each sentence as being a collection of indexes at which we combine words. This is obviously the power set of the set of indexes {1,2,3...N-1} and hence there are 2(N-1). It turns out however that it is nice to have a special word meaning "beggining of sentence" and another for "end of sentence", so we end up starting with N+2 words, and getting 2(N+1). That can be a big number!

So now that we have our n-grams for each sentence we want to look at transition probabilities between n-grams. The reason for this is that various parts of a sentence have unpredictable size. In the absense of a full NL parsing system there is no way to figure out what a syntactic unit (a noun phrase for instance) will be. This process completely obviates the need for an NL parser. This in itself is a huge win since NL parsing is at least difficult and probably impossible to do correctly because of idioms and variations in dialect. With the n-grams in hand we can now look at transition frequencies amoung the various n-grams in each of the different patterns in which they were combined. At this point we enter the information into a database which stores the transition probability between every two n-grams. Let us assume that we ignore sentences larger than 12 words. This means that we have 213 or 8192 words for a large sentence. This gives us 67,000,000 entries in our transition frequency matrix. O.K. So this is looking fairly intractable. If we decided that we will only look at correlations between neighbors and next neighbors however, we are back in the realm of possibility. This limitation has a certain justification beyond making things computationally feasible in that every element of the sentence will be a next nearest neighbor with an element of one of the n-gram sentences therby relating every possible syntactic unit. It should even be possible given this information to "guess" a parse based on our frequencies given a large enough corpus.

Stage 2 revolves around extracting information from a parallel corpus. We will simply perform a nearly identical procedure between two parallel corpuses.

When stage 1 and stage 2 are completed, we can use the probabilities of co-occurance from the parallel corpus in conjunction with the intra-language transition frequencies to generate "most probable" sentences.

We'll see how it goes.


Popular posts from this blog

Managing state in Prolog monadically, using DCGs.

Prolog is a beautiful language which makes a lot of irritating rudimentary rule application and search easy. I have found it is particularly nice when trying to deal with compilers which involve rule based transformation from a source language L to a target language L'.

However, the management of these rules generally requires keeping track of a context, and this context has to be explicitly threaded through the entire application, which involves a lot of irritating and error prone sequence variables. This often leads to your code looking something a bit like this:

compile(seq(a,b),(ResultA,ResultB),S0,S2) :- compile(a,ResultA,S0,S1), compile(b,ResultB,S1,S2).
While not the worst thing, I've found it irritating and ugly, and I've made a lot of mistakes with incorrectly sequenced variables. It's much easier to see sequence made explicitly textually in the code.

While they were not designed for this task, but rather for parsing, DCGs turn out to be a convenient …

Generating etags automatically when needed

Have you ever wanted M-. (the emacs command which finds the definition of the term under the cursor) to just "do the right thing" and go to the most current definition site, but were in a language that didn't have an inferior process set-up to query about source locations correctly (as is done in lisp, ocaml and some other languages with sophisticated emacs interfaces)?

Well, fret no more. Here is an approach that will let you save the appropriate files and regenerate your TAGS file automatically when things change assuring that M-. takes you to the appropriate place.

You will have to reset the tags-table-list or set it when you first use M-. and you'll want to change the language given to find and etags in the 'create-prolog-tags function (as you're probably not using prolog), but otherwise it shouldn't require much customisation.

And finally, you will need to run etags once manually, or run 'M-x create-prolog-tags' in order to get the initia…

Decidable Equality in Agda

So I've been playing with typing various things in System-F which previously I had left with auxiliary well-formedness conditions. This includes substitutions and contexts, both of which are interesting to have well typed versions of. Since I've been learning Agda, it seemed sensible to carry out this work in that language, as there is nothing like a problem to help you learn a language.

In the course of proving properties, I ran into the age old problem of showing that equivalence is decidable between two objects. In this particular case, I need to be able to show the decidability of equality over types in System F in order to have formation rules for variable contexts. We'd like a context Γ to have (x:A) only if (x:B) does not occur in Γ when (A ≠ B). For us to have statements about whether two types are equal or not, we're going to need to be able to decide if that's true using a terminating procedure.

And so we arrive at our story. In Coq, equality is som…