Re: [Apertium-stuff] Domain and style/genre

Kevin Brubeck Unhammer Mon, 17 Dec 2012 07:00:59 -0800

Per Tunedal <[email protected]>
writes:

> Hi,
> See below.
> Yours,
> Per Tunedal
>
> On Sun, Dec 16, 2012, at 23:07, Francis Tyers wrote:
>> El dg 16 de 12 de 2012 a les 14:21 +0100, en/na Per Tunedal va escriure:
>> > Hi,
>> > I consider info on primarily domain and secondly style useful for
>> > disambiguation. As a first step it would be very nice to be able to add
>> > a domain-tag to words. Adding info on style would make it possible to
>> > further improve translation results.
>> > 
>> > The translation would improve considerably if the user could choose the
>> > appropriate domain when demanding a translation. Consider e.g. the
>> > example of translating the English word "key", or similarly the French
>> > word "clé/clef", to Swedish. If the domain is e.g.
>> > Tourism/accommodation/real estate or similar, the word would most likely
>> > translate to "nyckel" (to lock/unlock the door of a house). On the other
>> > hand if the domain is e.g. information technology (or even music) the
>> > word would most likely translate to "tangent" (on your keyboard or
>> > piano). Obviously, a lexical selector/disambiguator could be trained on
>> > a corpus from a specific domain as well, further improving to the
>> > translation.
>> 
>> I did this in my thesis. It's quite effective. It's possible to tune the
>> vocabulary to a domain with either parallel or monolingual corpora using
>> apertium-lex-tools.[1] You won't be interested in it though as it
>> doesn't work with the Java version.
>
> What will work with the java-version? And what will not? What's the
> problem?


No one has re-implemented the apertium-lex-tools package in Java yet.

> What I would like to do is:
> - adding info about domain in the dictionaries
> - do some training on an appropriate corpus

You want to first add the domain-specific translation manually, and then
have the system automatically discover the domain-specific translation?
That sounds like duplicating work, and what do you do if the training
and dictionaries don't agree?


The way lex-tools training works is:

The English word "key" is listed in the en-sv bilingual dictionary with
both "nyckel" and "tangent" as possible translations. You then give the
bilingual dictionary and a corpus to the lex-tools training scripts,
these give you a .lrx file.

You can run the training scripts twice, once with a "general" corpus and
once with a "music-domain" corpus in order to get both a general .lrx
file and a music-domain .lrx file.

The training scripts don't need any manual specification of what domain
you're in, they learn what the best translation is from the
(domain-specific) corpus.

> - let the user choose a suitable domain (if any), as an alternative to
> the "general" domain

That'd be easy by giving a new translation mode, e.g. en-sv_music.mode
would point to a different lrx file from the general en-sv.mode.

> - let Apertium use info from the dictionaries and the training to solve
> ambiguities.
>
> BTW Would it do any difference if you trained the tagger on a domain
> corpus?

It might. 


-- 
Kevin Brubeck Unhammer

GPG: 0x766AC60C


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Domain and style/genre

Reply via email to