Per Tunedal <[email protected]> writes: > Hi, > See below. > Yours, > Per Tunedal > > On Sun, Dec 16, 2012, at 23:07, Francis Tyers wrote: >> El dg 16 de 12 de 2012 a les 14:21 +0100, en/na Per Tunedal va escriure: >> > Hi, >> > I consider info on primarily domain and secondly style useful for >> > disambiguation. As a first step it would be very nice to be able to add >> > a domain-tag to words. Adding info on style would make it possible to >> > further improve translation results. >> > >> > The translation would improve considerably if the user could choose the >> > appropriate domain when demanding a translation. Consider e.g. the >> > example of translating the English word "key", or similarly the French >> > word "clé/clef", to Swedish. If the domain is e.g. >> > Tourism/accommodation/real estate or similar, the word would most likely >> > translate to "nyckel" (to lock/unlock the door of a house). On the other >> > hand if the domain is e.g. information technology (or even music) the >> > word would most likely translate to "tangent" (on your keyboard or >> > piano). Obviously, a lexical selector/disambiguator could be trained on >> > a corpus from a specific domain as well, further improving to the >> > translation. >> >> I did this in my thesis. It's quite effective. It's possible to tune the >> vocabulary to a domain with either parallel or monolingual corpora using >> apertium-lex-tools.[1] You won't be interested in it though as it >> doesn't work with the Java version. > > What will work with the java-version? And what will not? What's the > problem?
No one has re-implemented the apertium-lex-tools package in Java yet. > What I would like to do is: > - adding info about domain in the dictionaries > - do some training on an appropriate corpus You want to first add the domain-specific translation manually, and then have the system automatically discover the domain-specific translation? That sounds like duplicating work, and what do you do if the training and dictionaries don't agree? The way lex-tools training works is: The English word "key" is listed in the en-sv bilingual dictionary with both "nyckel" and "tangent" as possible translations. You then give the bilingual dictionary and a corpus to the lex-tools training scripts, these give you a .lrx file. You can run the training scripts twice, once with a "general" corpus and once with a "music-domain" corpus in order to get both a general .lrx file and a music-domain .lrx file. The training scripts don't need any manual specification of what domain you're in, they learn what the best translation is from the (domain-specific) corpus. > - let the user choose a suitable domain (if any), as an alternative to > the "general" domain That'd be easy by giving a new translation mode, e.g. en-sv_music.mode would point to a different lrx file from the general en-sv.mode. > - let Apertium use info from the dictionaries and the training to solve > ambiguities. > > BTW Would it do any difference if you trained the tagger on a domain > corpus? It might. -- Kevin Brubeck Unhammer GPG: 0x766AC60C ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
