Hi Robert, At 2024-02-19T12:40:16-0500, Robert Goulding via wrote: > To answer my own question: It seems that preconv is not guessing the > correct encoding from the file with a single word in it. If I specify > -K utf-8 everything works OK. > > preconv -v reports: GNU preconv (groff) version 1.23.0 with iconv > support and with uchardet support > > Is this an expected shortcoming of preconv - that if a file contains > just a single accented character, it won't guess it correctly? The > original file it failed on was a 2-page pdf, which has the word > kataskeuê in the middle of it.
Yes. The man page says: Coding tags Text editors that support more than a single character encoding need tags within the input files to mark the file’s encoding. While it is possible to guess the right input encoding with the help of heuristics that produce good results for a preponderance of natural language texts, they are not absolutely reliable. Heuristics can fail on inputs that are too short or don’t represent a natural language. [...] The use of iconv means that characters in the input that encode invalid code points for that encoding may be dropped from the output stream or mapped to the Unicode replacement character (U+FFFD). Compare the following examples using the input “café” (note the “e” with an acute accent), which due to its short length challenges inference of the encoding used. printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv printf 'caf\351\n' | preconv -e us-ascii printf 'caf\351\n' | preconv -e latin-1 The fate of the accented “e” differs in each case. In the first, uchardet fails to detect an encoding (though the library on your system may behave differently) and preconv falls back to the locale settings, where octal 351 starts an incomplete UTF‐8 sequence and results in the Unicode replacement character. In the second, it is not a representable character in the declared input encoding of US‐ ASCII and is discarded by iconv. In the last, it is correctly detected and mapped. Regards, Branden
signature.asc
Description: PGP signature