Re: Accent mystery

Robert Goulding via Mon, 19 Feb 2024 09:46:56 -0800

Ahhhh, thank you so much (I needed to RTFM!) - R.

On Mon, Feb 19, 2024 at 12:44 PM G. Branden Robinson <
g.branden.robin...@gmail.com> wrote:


> Hi Robert,
>
> At 2024-02-19T12:40:16-0500, Robert Goulding via wrote:
> > To answer my own question: It seems that preconv is not guessing the
> > correct encoding from the file with a single word in it.  If I specify
> > -K utf-8 everything works OK.
> >
> > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv
> > support and with uchardet support
> >
> > Is this an expected shortcoming of preconv - that if a file contains
> > just a single accented character, it won't guess it correctly? The
> > original file it failed on was a 2-page pdf, which has the word
> > kataskeuê in the middle of it.
>
> Yes.  The man page says:
>
>    Coding tags
>      Text editors that support more than a single character encoding
>      need tags within the input files to mark the file’s encoding.
>      While it is possible to guess the right input encoding with the
>      help of heuristics that produce good results for a preponderance of
>      natural language texts, they are not absolutely reliable.
>      Heuristics can fail on inputs that are too short or don’t represent
>      a natural language.
> [...]
>      The use of iconv means that characters in the input that encode
>      invalid code points for that encoding may be dropped from the
>      output stream or mapped to the Unicode replacement character
>      (U+FFFD).  Compare the following examples using the input “café”
>      (note the “e” with an acute accent), which due to its short length
>      challenges inference of the encoding used.
>             printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
>             printf 'caf\351\n' | preconv -e us-ascii
>             printf 'caf\351\n' | preconv -e latin-1
>      The fate of the accented “e” differs in each case.  In the first,
>      uchardet fails to detect an encoding (though the library on your
>      system may behave differently) and preconv falls back to the locale
>      settings, where octal 351 starts an incomplete UTF‐8 sequence and
>      results in the Unicode replacement character.  In the second, it is
>      not a representable character in the declared input encoding of US‐
>      ASCII and is discarded by iconv.  In the last, it is correctly
>      detected and mapped.
>
> Regards,
> Branden
>


-- 
Robert Goulding
Director, John J. Reilly Center for Science, Technology, and Values;
Assoc. Professor, Program of Liberal Studies,
Fellow, Medieval Institute,
University of Notre Dame.

Re: Accent mystery

Reply via email to