On 23/07/17 12:49, Ralph Corderoy wrote: >> The simple reality is that, if we wish to preserve groff's current >> utility on MS-Windows, insistence on UTF-8 only as an input encoding >> is not a viable option. > > I understand the Windows API has moved to UTF-16, but does that mean > text file on disk are stored in that format?
Many of their tools do save in that format, yes. Some also support saving with a single or DBCS encoding, compatible with current code page, but... > I wouldn't have thought so else we'd hear of more problems with these > files being unreadable by many programs that expect one byte per > rune, or UTF-8. ...not UTF-8, (which is up to four bytes per "rune", so doesn't fit the single byte encoding model, the DBCS model, or the UTF-16LE model. > If groff read its input a byte at a time from a "binary" file, > parsing the multi-byte sequences that can occur in UTF-8, would that > not work on Windows for its plain text files? That could work, but with caveats. For example, if it's a "binary" file, (opened in O_BINARY mode), then the C/C++ API will not swallow the '\r' bytes of CRLF line endings, so groff would need to handle them. With a "text" file, (opened in O_TEXT mode), then those '\r' bytes are swallowed by the C/C++ API, so groff would never see them. OTOH, UTF-8 (beyond its pure ASCII subset) is not a plain text format on Windows, (since it conforms to neither DBCS nor UTF-16LE model, nor to any of the single byte encodings), so even creating such a file is a hurdle which not many Windows users will be willing to jump; (sure, it's not even difficult, e.g. to convert from UTF-16LE with iconv; it's just another obstacle to be overcome). -- Regards, Keith.