Ahhhh, thank you so much (I needed to RTFM!) - R. On Mon, Feb 19, 2024 at 12:44 PM G. Branden Robinson < g.branden.robin...@gmail.com> wrote:
> Hi Robert, > > At 2024-02-19T12:40:16-0500, Robert Goulding via wrote: > > To answer my own question: It seems that preconv is not guessing the > > correct encoding from the file with a single word in it. If I specify > > -K utf-8 everything works OK. > > > > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv > > support and with uchardet support > > > > Is this an expected shortcoming of preconv - that if a file contains > > just a single accented character, it won't guess it correctly? The > > original file it failed on was a 2-page pdf, which has the word > > kataskeuê in the middle of it. > > Yes. The man page says: > > Coding tags > Text editors that support more than a single character encoding > need tags within the input files to mark the file’s encoding. > While it is possible to guess the right input encoding with the > help of heuristics that produce good results for a preponderance of > natural language texts, they are not absolutely reliable. > Heuristics can fail on inputs that are too short or don’t represent > a natural language. > [...] > The use of iconv means that characters in the input that encode > invalid code points for that encoding may be dropped from the > output stream or mapped to the Unicode replacement character > (U+FFFD). Compare the following examples using the input “café” > (note the “e” with an acute accent), which due to its short length > challenges inference of the encoding used. > printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv > printf 'caf\351\n' | preconv -e us-ascii > printf 'caf\351\n' | preconv -e latin-1 > The fate of the accented “e” differs in each case. In the first, > uchardet fails to detect an encoding (though the library on your > system may behave differently) and preconv falls back to the locale > settings, where octal 351 starts an incomplete UTF‐8 sequence and > results in the Unicode replacement character. In the second, it is > not a representable character in the declared input encoding of US‐ > ASCII and is discarded by iconv. In the last, it is correctly > detected and mapped. > > Regards, > Branden > -- Robert Goulding Director, John J. Reilly Center for Science, Technology, and Values; Assoc. Professor, Program of Liberal Studies, Fellow, Medieval Institute, University of Notre Dame.