Re: [Groff] mom : unicode in .INCLUDE'd files

Keith Marshall Sun, 23 Jul 2017 10:11:17 -0700

On 23/07/17 12:49, Ralph Corderoy wrote:
>> The simple reality is that, if we wish to preserve groff's current
>> utility on MS-Windows, insistence on UTF-8 only as an input encoding
>> is not a viable option.
> 
> I understand the Windows API has moved to UTF-16, but does that mean
> text file on disk are stored in that format?


Many of their tools do save in that format, yes.  Some also support 
saving with a single or DBCS encoding, compatible with current code 
page, but...

> I wouldn't have thought so else we'd hear of more problems with these
> files being unreadable by many programs that expect one byte per
> rune, or UTF-8.

...not UTF-8, (which is up to four bytes per "rune", so doesn't fit 
the single byte encoding model, the DBCS model, or the UTF-16LE model.

> If groff read its input a byte at a time from a "binary" file,
> parsing the multi-byte sequences that can occur in UTF-8, would that
> not work on Windows for its plain text files?

That could work, but with caveats.  For example, if it's a "binary" 
file, (opened in O_BINARY mode), then the C/C++ API will not swallow 
the '\r' bytes of CRLF line endings, so groff would need to handle 
them.  With a "text" file, (opened in O_TEXT mode), then those '\r' 
bytes are swallowed by the C/C++ API, so groff would never see them.  
OTOH, UTF-8 (beyond its pure ASCII subset) is not a plain text format 
on Windows, (since it conforms to neither DBCS nor UTF-16LE model, nor 
to any of the single byte encodings), so even creating such a file is 
a hurdle which not many Windows users will be willing to jump; (sure, 
it's not even difficult, e.g. to convert from UTF-16LE with iconv; 
it's just another obstacle to be overcome).

-- 
Regards,
Keith.

Re: [Groff] mom : unicode in .INCLUDE'd files

Reply via email to