A 2014-10-26 19:00, Mikel Forcada escrigué: > Fran, folks, here's the feedback I promised. > > As I said, this is a great idea, particuarly to round off the work by > a constraint grammar, and I think the breakdown in GCI tasks could > work, perhaps excepting the integration in the current tagger.
Cool, perhaps we could split the integration into two or three tasks. It would basically just be another C++ class that is called via the apertium-tagger wrapper. > In a trained corpus we could collect counts in various levels as a > fallback: > > (1) Complete lexical forms: cantar.vblex.ifi.1.pl > (2) Lemma-less counts: *.vblex.ifi.1.pl > (3) Category only: *.vblex.* > > The last two levels can be determined without any need for a > configuration file. > > So that for an unknown word we can use some more general counts. These > general counts could be obtained from untagged corpora using naïve > fractional counting, as was done in SWPOST when no context was taken > into account. > > Note that for level (1) one does not really need to store counts. One > can simply store the winning lexical form for each surface form. Hmm, I'm not sure if this is the case... e.g. what would happen if you have, e.g. "^wound/wound<n><sg>/wind<vblex><past>/wound<vblex><pres>/wound<vblex><inf>$ From your corpus (or fractional counts or something) wound wound<n><sg> 100 wound wind<vblex><past> 20 wound wound<vblex><pres> 50 wound wound<vblex><inf> 3 And your most frequent analysis is wound<n><sg>, but your CG has removed it, and left "^wound/wind<vblex><past>/wound<vblex><pres>/wound<vblex><inf>$ Would it be good to know that the next most frequent analysis is wound<vblex><pres> ? > Note also that for levels (2) and (3) one does not really need to > store counts. An ordered list by decreasing number of frequency could > be enough: the first form found would win. Yes, for levels 2/3 it could definitely work. F. ------------------------------------------------------------------------------ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
