___           ___           ___           ___     
     /\  \         /\  \         /\__\         /\  \    
    /::\  \       /::\  \       /::|  |        \:\  \   
   /:/\:\  \     /:/\ \  \     /:|:|  |         \:\  \  
  /::\~\:\  \   _\:\~\ \  \   /:/|:|__|__       /::\  \ 
 /:/\:\ \:\__\ /\ \:\ \ \__\ /:/ |::::\__\     /:/\:\__\
 \/__\:\/:/  / \:\ \:\ \/__/ \/__/~~/:/  /    /:/  \/__/
      \::/  /   \:\ \:\__\         /:/  /    /:/  /     
       \/__/     \:\/:/  /        /:/  /     \/__/      
                  \::/  /        /:/  /                 
                   \/__/         \/__/                  
Prolog Statistical Machine Translation

Bryan McEleney, Dublin, Ireland

PSMT is an unsophisticated statistical machine translation program written in Prolog. It is available under the Lesser GNU Public Licence (LGPL) as free software. Statistical machine translation is used by Google for example, but while such systems are free to use, there is still a need for a translation system that is open and unmediated. A lot of research in statistical machine translation is undertaken by universities. However, so far only one system, Moses, which is still in development, promises to provide a complete set of training and decoding programs as open source software.

PSMT is not yet of much practical use. Try the online demo

System Requirements
Download from sourceforge
Description
Online Demo
How to use
Future work
Links




System requirements

PSMT is written for SWI Prolog (which is free), and therefore should run wherever SWI Prolog will run, ie GNU/Linux, Macintosh and Windows, although it has only been tested on GNU/Linux.


Description

PSMT consists of three main parts. There is a language model learner, which takes example sentences in the target language and learns a language model based on trigrams. There is a dictionary learner, which learns word for word translations. Finally there is the search program, which uses the data from the first two parts to translate a source sentence into the target language.

Language Model Learner
The language model learner uses the standard technique of recording trigrams from example sentences in the target language. Bigrams and unigrams are also recorded for "backoff", that is, to be used when there is no data available to cover the higher order ngram.

Dictionary Learner
The dictionary learner uses a bootstrapping technique to learn the dictionary. Given a partially learned dictionary, and an example pair of sentences, the most probable pairing of words for those sentences is selected according to the dictionary. This pairing is then used to update the dictionary. At first the pairings are random, but eventually the dictionary settles on the correct pairings.

Search Program
The search program uses a beam search over the space of sentences in the target language. The search is directed by the language model and the word translations.


How to use the system

In the "main.pl" file there is a predicate that can be used to call the learning modules. The directory name of the langauge pair must be supplied, and the training data is taken from this directory. In the "search_translation.pl" file, use load_data to load the data into memory, and then use the search predicate with a list of words. An n-best beam will be returned. To do all of this stuff in one go and translate the sentence "le chat bleu" into English, call the predicate quick_test in the "main.pl" file.

There are two sets of data that can be used with the system. A small testing set called french_english_test is good for a quick test of the system. A larger set of 12,000 example pairs from the European parliament corpus can also be used. This larger set produces a dictionary with quite a high perplexity making the search algorithm quite slow. With training of the language model and dictionary on this data, the result is slow and poor translations.


Future work

Training Data
The system may be free, but the parallel corpora required for training material often are not. For the moment a set of 12,000 sentence pairs from the European parliament corpus has been used to train the dictionary. Currently the system uses no additional data to train the language model. The OPUS project seeks to provide aligned data from parallel texts taken from the web as open source material. PSMT could be adapted to process OPUS data

Efficiency
The system is written entirely in Prolog. This is fine for the learning part of the system, but the translation search needs to be real-time. A reimplementation of the search algorithm in C++ is a possibility for the future

Named Entities
There is no provision for handling of named entities such as place-names and names of people.


Links

The Moses system is currently in development and it is hoped that they will provide a complete set of a decoder (search argorithm), translation model learner (dictionary learner) and a language model learner. Hopefully someone somewhere (such as OPUS) can provide free parallel corpora well.

A list of machine translation software at wikipedia

Freshmeat is a free software index. It can be searched for machine translation projects.

A similar search can be run at sourceforge