Re: [Apertium-stuff] Google Summer Of Code 2017

Анна Кондратьева Thu, 23 Mar 2017 11:34:28 -0700

Dear Mikel,

Thanks a lot for your feedback and your questions.

I've added answers and a bit more detailed explanations to my proposal.

Summary:

Q: Which problems are not correctly handled and would be better handled with the new approach?

A: The shallow function labeller should help to handle some existing problems in translating between languages, which are not closely related and belong to different types. For instance, when you are working with an ergative language, it may be useful to know, if an absolutive is subject or object. Or there may be cases like classical Russian example "Мать любит дочь", which equally could mean "Mother loves daughter" or "Daughter loves mother". Machine translation systems always prefer the first variant, but the meaning actually depends on syntactic functions of words. And so on. So, it means that shallow function labelling is a good way to reach better quality of translation for "ergative - nominative", "synthetic - analytic" and "(comparably) free word order - strict word order" language pairs.

Also I believe that the shallow function labelling stage can help to make the chunking stage of translation easier and more accurate.

Q: Which language pairs are you thinking of?

A: To be honest, I wanted to make the exact list of languages during community bonding period, but I thought about realising the idea for Basque, English, Russian/Finnish, maybe Spanish.

Q: How do you plan to treat possible discrepancies in the tagset of the UD bank and the Apertium tagset(s)?

A: I'm going to write a script, which will parse UD-treebank into a dataset for training. During this step I will also replace UD tags with suitable Apertium tags, so the dataset will contain a bunch of sequences of Apertium morphological tags.

Q How will you ensure that the machine-learned module, when inserted, will not slow down too much the Apertium pipeline?

A: Firstly, I can realise the idea using Keras library (it has Seq2Seq add-on) with Theano backend, these libraries are not as massive as Tensorflow, which is usually being used for creating seq2seq models, so the labeller should work comparably fast. Secondly, for each language I will create its own model, moreover, models will be trained just on sequences of morphological tags, not, for example, on tokens + tags. It means that models should not be overweight.

Q: Why should we adopt a corpus-based component?

A: I believe that the shallow syntactic function labeller trained on corpus data is more simple and effective way to label sentences than rule-based approach, because writing a good enough list of rules for determining a syntactic function of a word seems to be almost impossible even for a one language.

21.03.2017, 12:05, "Mikel L. Forcada" <[email protected]>:

Dear Anna,
I hope it is OK if I give my feedback here.
While it is true that better syntactic handling would give Apertium a better chance at producing better translations, particularly for languages that are not closely related, your proposal would need to be more convincing as to why adding a corpus-based shallow function labeller would be better than approaches that are actually present such as statistical (HMM or sliding-window) part-of-speech tagging, constraint grammar, or pattern-based syntactical transfer.
Which problems are not correctly handled and would be better handled with the new approach? Which language pairs are you thinking of? How do you plan to treat possible discrepancies in the tagset of the UD bank and the Apertium tagset(s)? How will you ensure that the machine-learned module, when inserted, will not slow down too much the Apertium pipeline?
Also, Apertium started as a rule-based machine translation with one corpus-based component: an HMM part-of-speech tagger. Later on, some languages have been endowed with rule-based part-of-speech tagging based on constraint grammars, in a move which clearly makes Apertium more rule-based and more transparent. Therefore, the adoption of a corpus-based component needs to be justified better.
I hope this helps
Mikel

El 20/03/17 a les 23:08, Анна Кондратьева ha escrit:
Hello everyone,
I'm a wannabe GSoC student and I'm very interested in Apertium projects.
So, I have written a draft of my proposal and will be extremely happy, if someone takes a look at it and gives me some constructive feedback.

http://wiki.apertium.org/wiki/User:Deltamachine/proposal

Thanks in advance!

------------------------------------------------------------------------------------------------------
Best regards,
Anna Kondratjeva
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
-- 
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776
,
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
,
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

------------------------------------------------------------------------------------------------------

С уважением,

Анна Кондратьева

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Google Summer Of Code 2017

Reply via email to