El 2017-11-08 13:56, Tommi A Pirinen escribió:
[Replies inline]
On Wed, 08 Nov 2017 13:25:59 +0100
Francis Tyers <[email protected]> wrote:
My question is: Is this ever the right thing to do ? I
struggle to come up with use cases for this. I'm not
sure how hard it would be to fix. But I thought I'd start
a discussion.
I've been bitten by this, latest in previous WMT shared taks. It
seems to me that both in the common MT material from outside and in my
own corpora by far the most common and most useful "plain" text file
format is newline separated sentences (with double-newlines between
paragraphs and titles), usually in a format of a parallel corpus with
either line-by-line matches or some mapping between the lines. The
retaining of line-structure is crucial for operating on these files,
also with external tools, automatic evaluation and so forth.
It seems to me that the default operation was meant for real plain text
that is formatted with perhaps constant line-width, like
gutenberg-corpus or wikipedia scrapings. Which is probably good for
what it is, but not what we usually work on.
I guess what is basically needed is file-formats and {de,re}formatters
for these two plain text formats. I think also that the command-line
should default the hard-line-break interpretation, the second form
is more like document translation or so, which is more commonly done
with graphical or web tools then.
I think that's probably a good compromise. Have two separate
deformatters
one for plain text, and the other for "hard newline" plain text. I'd
probably keep the default as the "soft newline" one and make the "hard
newline" one available through a -f option. The hard newline one could
also use apertium-destxt -n by default to avoid adding extra '.'.
One possible use case is people wanting to provide their own
tokenisation
to Apertium analysers. E.g. they don't want to have all the multiword
units in their output and want to control how many tokens there are per
sentence. Something like, suppose we have ^get away<n><sg>$ as a
multiword,
a person could make a file like:
This
is
a
get
away
.
$ cat file | apertium -f txtn eng-morph
and have ^get$ ^away$ as separate tokens.
Does this seem reasonable to people ?
Fran
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff