Hi Binay,
that's terrific. You know what would make it easier to test? Putting everything in a single .zip file! Or somewhere I can download all in one go, instead of going file by file. Could you do that for us?

Also, I don't think it makes much sense to have a project in SourceForge with the name "Basic Assimilation Toolkit for eu-en". I think it would make much more sense for you to temporarily host it somewhere else (e.g. GitHub), and to make it language independent. But that is your GSoC work I guess ...

Let me know when you have a nice installer for your toolkit.

Cheers

Mikel

 Al 03/14/2014 12:37 PM, En/na Binay Neekhra ha escrit:
Respected mentors and Apertium community,
My name is Binay Neekhra.
I am interested in doing 'Apertium assimilation evaluation toolkit' as a GSoC project.

I have attempted the coding challenge. You may have a look at the
Sourceforge link sourceforge.net/projects/basicassimilationtoolkit/files/?source=navbar <http://sourceforge.net/projects/basicassimilationtoolkit/files/?source=navbar>

In brief:
For this task, I have taken Basque-English pair (Thanks to Mr. Tyres for
suggesting this). I have written a basic Python program basicToolkit.py

There are 6 files,
1.  README file codepad.org/iEO8PkWa <http://codepad.org/iEO8PkWa>
2. 'Source Sentences.txt' codepad.org/HwRIZsLx <http://codepad.org/HwRIZsLx> contains Basque Sentences, taken from the news site berria.info <http://berria.info> 3. 'Apertium Translation.txt' codepad.org/cL8JXY6L <http://codepad.org/cL8JXY6L>
    contains Apertium eu-en machine translation output of the 'Source
    Sentences.txt'
4. 'Reference Translation.txt' codepad.org/eZqNmlMK <http://codepad.org/eZqNmlMK>
    which contains Google Translate output of the Source Sentences. This
    output is taken for evaluation purpose.
5. basicToolkit.py codepad.org/a7KAC7U2 <http://codepad.org/a7KAC7U2>
It takes sentences from 'Source Sentences.txt', 'Apertium Translation.txt', and 'Reference Translation.txt', and based on hint level chosen by the user, it shows the relevant hints and ask the user to complete the cloze test. The
    response is recorded in a separate file(userOutput.txt).
6. userOutput.txt codepad.org/mN7m8H0x <http://codepad.org/mN7m8H0x>
contains the output of assimilation evaluation performed on the above files. it
    contains user input, %of holes successfully filled, %of blanks left,
    reference sentences, hint level and few other details.

I have read the paper suggested by Prof. Forcada (Peeking through the
language barrier: the development of open-source gisting system for
Basque to English based on apertium.org <http://apertium.org>), along with H.Somes and E.Wild
paper on 'Evaluating Machine Translation: the Cloze Procedure Revisited'.

I have also gone through the Apertium documentation and modules specification of Apertium in brief. I have installed Apertium and running it for Basque-English
and Esperanto-English pair.

I have following observations/ideas:
The toolkit can provide the following options:

For masking procedure,
1. An option to select what % of words to be masked
2. option for masking the words randomly or at regular intervals,
3. words may also be selected on the basis of their POS tags.
The system may provide the option to select the distribution of POS tags
to mask the words.
e.g. 20% Nouns/Pronouns, 40%Verb, 40%Adjectives etc.
for this we need to integrate the part-of-speech tagger of Apertium with the
  toolkit.

For evaluation purposes, the system can use synonym list to look up for
similar words (acceptable answers) or have binary evaluation.

For proper name, figure or date, it may be difficult for the user to guess the
correct output, these fields may be handled separately, in which case  a
plausible but wrong guess may be acceptable.

In the paper, H.Somers and Wild mention that "we feel confident that the exact-answer scoring method is adequate, and that allowing near synonyms and so on does not give a different result". I feel that in the case of gisting, however, using 'acceptable answers' will be significant. It is also reflected in the results obtained in 'Peeking through..
...on apertium.org <http://apertium.org>' paper. Am I correct?

I still not have very clear idea about what counts as a 'correct' answer and how do we calculate effective 'score' while comparing two machine translation systems?

I am very interested in doing this project. How should I proceed further?

My language preference for this project is Python(flexible). I want to use Python for both text and web based formats(using web2py or Django framework),as it will allow better maintenance, and code fixing. (I have done projects in web2py framework). If needed I will be able to develop the toolkit on Ruby on Rails too.(I am familiar with Ruby)

I tried to open the Apertium Bugs page on Bugzilla(link on Ideas page). The page is showing the 'Internal Server Error'. Is their any other address for the Apertium
bug listing?

About Me:
My name is Binay Neekhra. I am pursuing B.Tech + M.S.(by research) in
Computer Science and Engineering, from International Institute of Information
Technology-Hyderabad, India. I am pursuing my M.S. in Language Technology
Research Centre, IIIT-H. My research interests are Machine Translation, Natural Language Processing, Artificial Intelligence and Theoretical Computer Science.

-Binay Neekhra
IRC nick: niks, binayneekhra



------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech


_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to