Re: [Apertium-stuff] Stop merging lines

Kevin Brubeck Unhammer Tue, 06 Nov 2018 11:37:39 -0800

Francis Tyers <[email protected]> čálii:

> Yes it does. It will put a sentence boundary after every word, meaning
> that you won't get reliable tagger output. Apertium as far as I know
> has no way to treat sentences as a sequence of lines. This is because
> of how the format handling works.
>
> I think it would really be an excellent feature though. Perhaps a
> GitHub issue? I do however think it would involve messing with quite a
> bit of the pipeline.


However, we *should* treat NUL as hard separators – if we don't,
apertium-apy (and thus www.apertium.org) will risk sending output meant
for person1 to person2. (I have an inkling there might still be bugs in
apertium-transfer related to this.)

Anyway, if we at least handle NUL's correctly in lt-proc and cg-proc,
you could turn linebreak's into NUL's (first deleting any existing NUL's
in the corpus) and tag with the -z option to lt-/cg-proc:

    cat corpus.txt                                   \
    | tr -d '\0'                                     \
    | tr '\n' '\0'                                   \
    | apertium-deshtml -n                            \
    | lt-proc -z -w 'apertium-tat/tat.automorf.bin'  \
    | cg-proc -z 'apertium-tat/tat.rlx.bin'          \
    | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \
    | tr '\0' '\n'                                   \
    | apertium-rehtml-noent

… finally turning NUL's back into newlines.

With apertium-nob, this doesn't seem to run slower than without -z, and
doesn't merge lines in my test corpus.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Stop merging lines

Reply via email to