Re: [Apertium-stuff] GSoC 14, 'Apertium assimilation evaluation toolkit' Project

Mikel Forcada Fri, 14 Mar 2014 12:21:07 -0700

Hi Binay,

that's terrific. You know what would make it easier to test? Puttingeverything in a single .zip file! Or somewhere I can download all in onego, instead of going file by file. Could you do that for us?

Also, I don't think it makes much sense to have a project in SourceForgewith the name "Basic Assimilation Toolkit for eu-en". I think it wouldmake much more sense for you to temporarily host it somewhere else (e.g.GitHub), and to make it language independent. But that is your GSoC workI guess ...


Let me know when you have a nice installer for your toolkit.

Cheers

Mikel

 Al 03/14/2014 12:37 PM, En/na Binay Neekhra ha escrit:

Respected mentors and Apertium community,
My name is Binay Neekhra.
I am interested in doing 'Apertium assimilation evaluation toolkit' asa GSoC project.
I have attempted the coding challenge. You may have a look at the
Sourceforge linksourceforge.net/projects/basicassimilationtoolkit/files/?source=navbar<http://sourceforge.net/projects/basicassimilationtoolkit/files/?source=navbar>
In brief:
For this task, I have taken Basque-English pair (Thanks to Mr. Tyres for
suggesting this). I have written a basic Python program basicToolkit.py

There are 6 files,
1.  README file codepad.org/iEO8PkWa <http://codepad.org/iEO8PkWa>
2. 'Source Sentences.txt' codepad.org/HwRIZsLx<http://codepad.org/HwRIZsLx>contains Basque Sentences, taken from the news site berria.info<http://berria.info>3. 'Apertium Translation.txt' codepad.org/cL8JXY6L<http://codepad.org/cL8JXY6L>
    contains Apertium eu-en machine translation output of the 'Source
    Sentences.txt'
4. 'Reference Translation.txt' codepad.org/eZqNmlMK<http://codepad.org/eZqNmlMK>
    which contains Google Translate output of the Source Sentences. This
    output is taken for evaluation purpose.
5. basicToolkit.py codepad.org/a7KAC7U2 <http://codepad.org/a7KAC7U2>
It takes sentences from 'Source Sentences.txt', 'ApertiumTranslation.txt', and'Reference Translation.txt', and based on hint level chosen by theuser, itshows the relevant hints and ask the user to complete the clozetest. The
    response is recorded in a separate file(userOutput.txt).
6. userOutput.txt codepad.org/mN7m8H0x <http://codepad.org/mN7m8H0x>
contains the output of assimilation evaluation performed on theabove files. it
    contains user input, %of holes successfully filled, %of blanks left,
    reference sentences, hint level and few other details.

I have read the paper suggested by Prof. Forcada (Peeking through the
language barrier: the development of open-source gisting system for
Basque to English based on apertium.org <http://apertium.org>), alongwith H.Somes and E.Wild
paper on 'Evaluating Machine Translation: the Cloze Procedure Revisited'.
I have also gone through the Apertium documentation and modulesspecificationof Apertium in brief. I have installed Apertium and running it forBasque-English
and Esperanto-English pair.

I have following observations/ideas:
The toolkit can provide the following options:

For masking procedure,
1. An option to select what % of words to be masked
2. option for masking the words randomly or at regular intervals,
3. words may also be selected on the basis of their POS tags.
The system may provide the option to select the distribution of POS tags
to mask the words.
e.g. 20% Nouns/Pronouns, 40%Verb, 40%Adjectives etc.
for this we need to integrate the part-of-speech tagger of Apertiumwith the
  toolkit.

For evaluation purposes, the system can use synonym list to look up for
similar words (acceptable answers) or have binary evaluation.
For proper name, figure or date, it may be difficult for the user toguess the
correct output, these fields may be handled separately, in which case  a
plausible but wrong guess may be acceptable.
In the paper, H.Somers and Wild mention that "we feel confident thatthe exact-answerscoring method is adequate, and that allowing near synonyms and so ondoes not givea different result". I feel that in the case of gisting, however,using 'acceptable answers'will be significant. It is also reflected in the results obtained in'Peeking through..
...on apertium.org <http://apertium.org>' paper. Am I correct?
I still not have very clear idea about what counts as a 'correct'answer andhow do we calculate effective 'score' while comparing two machinetranslation systems?
I am very interested in doing this project. How should I proceed further?
My language preference for this project is Python(flexible). I want touse Pythonfor both text and web based formats(using web2py or Djangoframework),as itwill allow better maintenance, and code fixing. (I have done projectsin web2py framework).If needed I will be able to develop the toolkit on Ruby on Railstoo.(I am familiar with Ruby)
I tried to open the Apertium Bugs page on Bugzilla(link on Ideaspage). The pageis showing the 'Internal Server Error'. Is their any other addressfor the Apertium
bug listing?

About Me:
My name is Binay Neekhra. I am pursuing B.Tech + M.S.(by research) in
Computer Science and Engineering, from International Institute ofInformation
Technology-Hyderabad, India. I am pursuing my M.S. in Language Technology
Research Centre, IIIT-H. My research interests are MachineTranslation, NaturalLanguage Processing, Artificial Intelligence and Theoretical ComputerScience.
-Binay Neekhra
IRC nick: niks, binayneekhra



------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech


_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff



--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 14, 'Apertium assimilation evaluation toolkit' Project

Reply via email to