Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Hèctor Alòs i Font Tue, 21 May 2019 20:57:00 -0700

Hi Amr,

The files should be there. It was a mistake. I added them. Each line has a
word. If something is not clear, the "raw" files have the original texts
without morphological analysis.


About other manually tagged corpora, I cannot help. This is the only one we
did in the projects I've been working. Unfortunately, we didn't find the
time for creating a similar one for Sardinian two years ago.

Best,
Hèctor

Missatge de Amr Mohamed Hosny Anwar <[email protected]> del dia dt.,
21 de maig 2019 a les 19:05:

> Hi Hector,
>
> Yes, these files are for sure what I need.
> However, it seems like these files (*.tagged.txt) aren't part of the
> upstream repository:
> https://github.com/apertium/apertium-oci/tree/master/texts
>
> I am currently experimenting with the English and Italian tagged
> corpora/morphological analysers.
> The more languages we have, the better we can compare between weighting
> methodologies.
> I don't have a strong background in linguistics so I thought it'd be
> better if you can recommend me corpora from different diverse languages.
>
> Thanks,
> Amr
>
> ------------------------------
> *From:* Hèctor Alòs i Font <[email protected]>
> *Sent:* Tuesday, May 21, 2019 12:42:02 PM
> *To:* [apertium-stuff]
> *Subject:* Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of
> automata - Implementing the supervised method of weighing autoamata
>
> Hi Amr,
>
> I'm not sure it may help you, but in apertium-oci/texts there are several
> texts in Occitan manually disambiguated. Aprox. 14,000 words. They are:
> atom_gascon.tagged.txt
> continent.tagged.txt
> glacier.tagged.txt
> cors_aran.tagged.txt
> hlama_coming.tagged.txt
> uranus_prov.tagged.txt
>
> Best,
> Hector
>
> Missatge de Amr Mohamed Hosny Anwar <[email protected]> del dia
> dg., 19 de maig 2019 a les 2:59:
>
> Dear maintainers, contributors,
>
> Hope this email finds you well.
>
> This mail can be considered as a status report for detailing next week's
> plan in addition to seeking feedback/ suggestions regarding the project.
> After a fruitful discussion with my mentors Nick, Flammie and Francis, we
> have agreed on implementing the supervised way of weighing automata as
> follows:
>
> The command will look like: lt-weight transducer.bin corpus.tagged
>
> transducer.bin: A FST compiled using lttoolbox.
> corpus.tagged: A tagged corpus that will be used to estimate the weights.
>
> The weighting will be done by composing the main "unweighted" FST with a
> set of simple FSTs that are generated for each token.
> A simplified example: If the main FST had an edge a:b::0 and the estimated
> weight for this edge is W, then The main FST will be composed with a simple
> FST of an edge b:b::w generating a new FST with an edge a:b::W.
>
> To achieve this, I will create a new shell script that makes use of hfst's
> compose (Instead of implementing/adding a compose function to the
> lttoolbox). We will approve and use this approach if the prototype has
> proven to be functioning as expected.
>
> The shell script will work as follows:
> 1) lt-print will be used to convert the FST to at&t format.
> 2) The weights will be estimated from the tagged corpus by counting the
> unigram lexical forms (A clever set of shell commands can do the job but I
> am not an expert in shell scripting so it will take me some time - I am
> open to suggestions/ sources/ examples for doing so).
> 3) For each weighted string, hfst-str2fst (or the corresponding regex
> version) will be used to generate simple FSTS.
> 4) The FSTs will be composed using hfst-compose.
> 5) The final FST will be converted to at&t format.
> 6) lt-comp will be be used to regenerate a weighted FST that is compatable
> with all the tools that rely on apertium.
>
> In this version, We will just use unigram counts for the lexical forms to
> estimate the weights.
> Additionally, The weight will be assigned to the final state and won't be
> distributed among the edges (We will most probably want to change this
> later).
>
> On the other hand, I will try to improve the list of publications/ideas
> that will be used to weigh automata in an unsupervised way.
> I would be grateful if you can share with me resources/ ideas regarding
> this part.
>
> Finally, Do you have recommendations for tagged corpora that can be used
> throughout the project for benchmarking?
> I am using this English Tagged corpus from the apertium-eng repository (
> https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
> It would be better if we can do benchmarking on corpora and FSTs of
> different sizes and complexity.
>
> Thanks and looking forward to hearing from you.
> Your suggestions, feedback, feature requests are more than welcome.
>
> Best Regards,
> Amr
>
> ------------------------------
> *From:* Amr Mohamed Hosny Anwar
> *Sent:* Sunday, May 19, 2019 12:50:52 AM
> *To:* apertium-stuff
> *Cc:* [email protected]
> *Subject:* GSoC 19: Unsupervised weighting of automata - Implementing the
> supervised method of weighing autoamata
>
>
> Dear maintainers, contributors,
>
>
> Hope this email finds you well.
>
> This mail can be considered as a status report for detailing next week's
> plan in addition to seeking feedback/ suggestions regarding the project.
>
>
>
> Best Regards,
> Amr Keleg
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of automata - Implementing the supervised method of weighing autoamata

Reply via email to