Hi Amr, The files should be there. It was a mistake. I added them. Each line has a word. If something is not clear, the "raw" files have the original texts without morphological analysis.
About other manually tagged corpora, I cannot help. This is the only one we did in the projects I've been working. Unfortunately, we didn't find the time for creating a similar one for Sardinian two years ago. Best, Hèctor Missatge de Amr Mohamed Hosny Anwar <[email protected]> del dia dt., 21 de maig 2019 a les 19:05: > Hi Hector, > > Yes, these files are for sure what I need. > However, it seems like these files (*.tagged.txt) aren't part of the > upstream repository: > https://github.com/apertium/apertium-oci/tree/master/texts > > I am currently experimenting with the English and Italian tagged > corpora/morphological analysers. > The more languages we have, the better we can compare between weighting > methodologies. > I don't have a strong background in linguistics so I thought it'd be > better if you can recommend me corpora from different diverse languages. > > Thanks, > Amr > > ------------------------------ > *From:* Hèctor Alòs i Font <[email protected]> > *Sent:* Tuesday, May 21, 2019 12:42:02 PM > *To:* [apertium-stuff] > *Subject:* Re: [Apertium-stuff] GSoC 19: Unsupervised weighting of > automata - Implementing the supervised method of weighing autoamata > > Hi Amr, > > I'm not sure it may help you, but in apertium-oci/texts there are several > texts in Occitan manually disambiguated. Aprox. 14,000 words. They are: > atom_gascon.tagged.txt > continent.tagged.txt > glacier.tagged.txt > cors_aran.tagged.txt > hlama_coming.tagged.txt > uranus_prov.tagged.txt > > Best, > Hector > > Missatge de Amr Mohamed Hosny Anwar <[email protected]> del dia > dg., 19 de maig 2019 a les 2:59: > > Dear maintainers, contributors, > > Hope this email finds you well. > > This mail can be considered as a status report for detailing next week's > plan in addition to seeking feedback/ suggestions regarding the project. > After a fruitful discussion with my mentors Nick, Flammie and Francis, we > have agreed on implementing the supervised way of weighing automata as > follows: > > The command will look like: lt-weight transducer.bin corpus.tagged > > transducer.bin: A FST compiled using lttoolbox. > corpus.tagged: A tagged corpus that will be used to estimate the weights. > > The weighting will be done by composing the main "unweighted" FST with a > set of simple FSTs that are generated for each token. > A simplified example: If the main FST had an edge a:b::0 and the estimated > weight for this edge is W, then The main FST will be composed with a simple > FST of an edge b:b::w generating a new FST with an edge a:b::W. > > To achieve this, I will create a new shell script that makes use of hfst's > compose (Instead of implementing/adding a compose function to the > lttoolbox). We will approve and use this approach if the prototype has > proven to be functioning as expected. > > The shell script will work as follows: > 1) lt-print will be used to convert the FST to at&t format. > 2) The weights will be estimated from the tagged corpus by counting the > unigram lexical forms (A clever set of shell commands can do the job but I > am not an expert in shell scripting so it will take me some time - I am > open to suggestions/ sources/ examples for doing so). > 3) For each weighted string, hfst-str2fst (or the corresponding regex > version) will be used to generate simple FSTS. > 4) The FSTs will be composed using hfst-compose. > 5) The final FST will be converted to at&t format. > 6) lt-comp will be be used to regenerate a weighted FST that is compatable > with all the tools that rely on apertium. > > In this version, We will just use unigram counts for the lexical forms to > estimate the weights. > Additionally, The weight will be assigned to the final state and won't be > distributed among the edges (We will most probably want to change this > later). > > On the other hand, I will try to improve the list of publications/ideas > that will be used to weigh automata in an unsupervised way. > I would be grateful if you can share with me resources/ ideas regarding > this part. > > Finally, Do you have recommendations for tagged corpora that can be used > throughout the project for benchmarking? > I am using this English Tagged corpus from the apertium-eng repository ( > https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged) > It would be better if we can do benchmarking on corpora and FSTs of > different sizes and complexity. > > Thanks and looking forward to hearing from you. > Your suggestions, feedback, feature requests are more than welcome. > > Best Regards, > Amr > > ------------------------------ > *From:* Amr Mohamed Hosny Anwar > *Sent:* Sunday, May 19, 2019 12:50:52 AM > *To:* apertium-stuff > *Cc:* [email protected] > *Subject:* GSoC 19: Unsupervised weighting of automata - Implementing the > supervised method of weighing autoamata > > > Dear maintainers, contributors, > > > Hope this email finds you well. > > This mail can be considered as a status report for detailing next week's > plan in addition to seeking feedback/ suggestions regarding the project. > > > > Best Regards, > Amr Keleg > _______________________________________________ > Apertium-stuff mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > _______________________________________________ > Apertium-stuff mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
