Hi Raul,

part of what you say makes sense, see below for a few remarks.

But i have better news.  The itching now got bad enough that i
finally sat down and integrated preconv(1) into mandoc(1), a task
i first talked about at BSDCan 2011 in Ottawa.  I already committed
the result to the portable mandoc on mdocml.bsd.lv.  I'll polish
it up a bit and then commit it to OpenBSD, too, probably in a few
days.

  http://mdocml.bsd.lv/cgi-bin/cvsweb/?cvsroot=mdocml&sortby=date

What it does is this:  When you run mandoc(1) as you always did,
with no new options, it guesses the encoding, transforms UTF-8
or ISO-8859-1 input characters to roff(7) escape sequences on
the fly, at the input stage, so (almost) everything is going to
magically work as expected.  You can specify the input encoding
on the command line, with a BOM, or with an emacs mode line,
but usually, none of that will be needed to properly handle UTF-8.

Oh, but still, even if it seems to work with mandoc(1) in the
future, don't start putting non-ASCII characters into manuals!

Yours,
  Ingo


P.S.
Some additional comments:

Raul Miller wrote on Fri, Oct 24, 2014 at 08:21:40PM -0400:

> I have been following this list for some time, and I feel I might have
> something to add here. Please do not hesitate to tell me to shut up if this
> just makes things worse.
> 
> --------------------------
> 
> Here is a list of things that might be spaces in an otherwise
> ascii-appearing document, according to the wikipedia page on
> non-breaking space:

It is rarely useful to study the union of several different languages.
It is almost always better for understanding to agree on one language
beforehand and then stick to it.  In the case at hand, the language
we are talking about is roff(7).  In that language, there is exactly
one notation for the unpaddable non-breaking space character (`\ ')
and exactly one notation for the paddable non-breaking space
character (`\~').  That would be 0x5c20 and 0x5c7e, neither of which
appears in your list, while none of the sequences in your list are
valid roff(7).

> Anyways, I guess the point here is that if you "automatically fix" some
> documents you will "automatically break" some other documents.

In this case, not really.  Right now, we just render any non-ASCII
byte as a question mark (`?').  So right now, every non-ASCII
document renders poorly.  From where we stand, it can only get
better.  Choosing our way, the trick is to follow established
choices as far as possible and reasonable.

> As a general rule, the best solution is the one that combines
> simplicity of documentation with simplicity of implementation,
> as well as tolerable failure modes.

Indeed.  My patch enhances functionality and removes more
than 300 lines of code and documentation from portable mandoc.

> That said, personally I tend to favor a warning message for the
> case where out-of-standard characters are encountered.

Yes, that's what i will do.  Downgrade most mandoc(1) ERRORs
in this area to WARNINGs, except when the document contradicts
itself or the command line options, in which cases ERRORs remain
appropriate.

> But I suppose it's reasonable to create a new standard.
> Perhaps "ascii with certain non-breaking spaces"?
> There's always room for more standards?

Already exists internally in mandoc, look for ASCII_* in mandoc.h.
But we certainly don't want to make that a new *public* standard.
Besides, all this is not just about spaces, that's merely one example.

Reply via email to