Missatge de Aure Séguier <[email protected]> del dia dv., 4 de nov. 2022 a les 15:00:
> Hi, > > I can help to make rules to know if there are enunciatives in a text or no. > > About recognising which variety of occitan we are translating, we are > currently developping a tool that can differentiate every dialect of > occitan, but it isn't very efficient. Between Gascon and other dialects, > it's OK, because Gascon is so different (except for Aranese, which is a > subdialect of Gascon). But between Languedocien and other varieties > (Provençal, Limousin...) there are many confusions. > > At first, I was thinking about adding a "all dialects" dialect for the > oc->fr direction. It would be useful when people don't know the dialect of > a text, or for texts with many dialects (e.g. a website with articles in > many dialects, like newspapers). Is that something that was already done > for another language ? Is it something that could be easily done ? > This is the way it works for Catalan and Portuguese. They use the v tag instead of the alt tag in the dictionaries. The people who initially developed Occitan in Apertium preferred not to do so. Occitan is too diverse. Each variety already has a lot of very frequent homographs because the spelling rules have nothing to distinguish them (unlike French, Spanish, Italian, Catalan...). But when several varieties are added, the problem is much bigger. Think of the Provençal article. If we know that the text is Provençal, disambiguation is much easier. Or if we know that it is Gascon with enunciatives, we also know what we can find, etc. I myself immediately switch to a "Gascon" mode when I read it because its syntax is quite different from the rest (+ enclitics, + concordance of verb tenses...). This information is basic to have a correct disambiguation. > Thanks > Aura Séguier, responsabla de projèctes e desvolopaira > Lo Congrès permanent de la lenga occitana > Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau > T. +33 (0)5 32 00 00 64 > [email protected] > www.locongres.org > Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit : > > What if you do > > lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | > … > > The first CG step would output a stream variable, so that what the next > step sees is > > [<STREAMCMD:SETVARIABLE:non-enon>] > ^que/que<enon>/que<itg>$ > [more text here] > > If the next step is CG, it's just > > REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ; > > ie. remove enunciatives whenever the var is set. > > One can also unset it in the middle of the stream (if doing corpus > runs), so output of the enon-detector is > > [<STREAMCMD:SETVARIABLE:non-enon>] > ^que/que<enon>/que<itg>$ > [more text here] > [<STREAMCMD:REMVARIABLE:non-enon>] > ^que/que<enon>/que<itg>$ > [more text here] > > and the REMOVE:var-is-set rule will remove enunciatives in the first > part, not after seeing the REMVARIABLE. > > > Then the problem of looking several windows ahead is restricted to that > first enon-detector step. > > > ---- > > Alternatively, if we assume all the input is of the same language, we > just don't know what language it is ahead of time, then you could > do several passes, where one is a detector pipeline like > > lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin > > that outputs the STREAMCMD and then Apy would grep for that, and insert > the STREAMCMD at the start of the call to the regular pipeline > > lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | … > > That won't automatically work in modes files, and won't work for corpus > tests if the corpus has a mix, but OTOH you could use 'export > AP_SETVAR=non-enon' to force the regular pipeline to insert the > STREAMCMD at the start. > > > > > _______________________________________________ > Apertium-stuff mailing > [email protected]https://lists.sourceforge.net/lists/listinfo/apertium-stuff > > _______________________________________________ > Apertium-stuff mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
