Sunday, 16 July 2006

Strategies for automatic translation

I've been hacking away at my program to test a theory I have about machine translation. I wrote a bit about it in a previous post but I was fairly vague. I thought I'd describe in more detail exactly how the technique would work (I'm still in phase 1).

The idea is simple. The first phase is to take a corpus in a language. Take each sentence of the source (or some other sized chunk, currently I'm limited by computational tractability to a single sentence) and recombine each element of the sentence into every possible string of n-grams. If you play with it a bit you'll realise that there are 2(N-1) of these for a string of size N. One way to think about it is that there are N-1 indexes into the spaces between words in the string. You can then think of each sentence as being a collection of indexes at which we combine words. This is obviously the power set of the set of indexes {1,2,3...N-1} and hence there are 2(N-1). It turns out however that it is nice to have a special word meaning "beggining of sentence" and another for "end of sentence", so we end up starting with N+2 words, and getting 2(N+1). That can be a big number!

So now that we have our n-grams for each sentence we want to look at transition probabilities between n-grams. The reason for this is that various parts of a sentence have unpredictable size. In the absense of a full NL parsing system there is no way to figure out what a syntactic unit (a noun phrase for instance) will be. This process completely obviates the need for an NL parser. This in itself is a huge win since NL parsing is at least difficult and probably impossible to do correctly because of idioms and variations in dialect. With the n-grams in hand we can now look at transition frequencies amoung the various n-grams in each of the different patterns in which they were combined. At this point we enter the information into a database which stores the transition probability between every two n-grams. Let us assume that we ignore sentences larger than 12 words. This means that we have 213 or 8192 words for a large sentence. This gives us 67,000,000 entries in our transition frequency matrix. O.K. So this is looking fairly intractable. If we decided that we will only look at correlations between neighbors and next neighbors however, we are back in the realm of possibility. This limitation has a certain justification beyond making things computationally feasible in that every element of the sentence will be a next nearest neighbor with an element of one of the n-gram sentences therby relating every possible syntactic unit. It should even be possible given this information to "guess" a parse based on our frequencies given a large enough corpus.

Stage 2 revolves around extracting information from a parallel corpus. We will simply perform a nearly identical procedure between two parallel corpuses.

When stage 1 and stage 2 are completed, we can use the probabilities of co-occurance from the parallel corpus in conjunction with the intra-language transition frequencies to generate "most probable" sentences.

We'll see how it goes.