Re: [Apertium-stuff] New Occitan-French release

Aure Séguier Fri, 04 Nov 2022 04:59:58 -0700

Hi,

I can help to make rules to know if there are enunciatives in a text or no.

About recognising which variety of occitan we are translating, we arecurrently developping a tool that can differentiate every dialect ofoccitan, but it isn't very efficient. Between Gascon and other dialects,it's OK, because Gascon is so different (except for Aranese, which is asubdialect of Gascon). But between Languedocien and other varieties(Provençal, Limousin...) there are many confusions.

At first, I was thinking about adding a "all dialects" dialect for theoc->fr direction. It would be useful when people don't know the dialectof a text, or for texts with many dialects (e.g. a website with articlesin many dialects, like newspapers). Is that something that was alreadydone for another language ? Is it something that could be easily done ?


Thanks

Aura Séguier, responsabla de projèctes e desvolopaira
Lo Congrès permanent de la lenga occitana
Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
T. +33 (0)5 32 00 00 64
[email protected] <mailto:[email protected]>
www.locongres.org <http://www.locongres.org>
Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit :

What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …

The first CG step would output a stream variable, so that what the next
step sees is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]

If the next step is CG, it's just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]
[<STREAMCMD:REMVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.


----

Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.



_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

Reply via email to