Hi,

I can help to make rules to know if there are enunciatives in a text or no.

About recognising which variety of occitan we are translating, we are currently developping a tool that can differentiate every dialect of occitan, but it isn't very efficient. Between Gascon and other dialects, it's OK, because Gascon is so different (except for Aranese, which is a subdialect of Gascon). But between Languedocien and other varieties (Provençal, Limousin...) there are many confusions.

At first, I was thinking about adding a "all dialects" dialect for the oc->fr direction. It would be useful when people don't know the dialect of a text, or for texts with many dialects (e.g. a website with articles in many dialects, like newspapers). Is that something that was already done for another language ? Is it something that could be easily done ?

Thanks

Aura Séguier, responsabla de projèctes e desvolopaira
Lo Congrès permanent de la lenga occitana
Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
T. +33 (0)5 32 00 00 64
[email protected] <mailto:[email protected]>
www.locongres.org <http://www.locongres.org>
Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit :
What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …

The first CG step would output a stream variable, so that what the next
step sees is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]

If the next step is CG, it's just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]
[<STREAMCMD:REMVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.


----

Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.



_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to