Hi Raul, part of what you say makes sense, see below for a few remarks.
But i have better news. The itching now got bad enough that i finally sat down and integrated preconv(1) into mandoc(1), a task i first talked about at BSDCan 2011 in Ottawa. I already committed the result to the portable mandoc on mdocml.bsd.lv. I'll polish it up a bit and then commit it to OpenBSD, too, probably in a few days. http://mdocml.bsd.lv/cgi-bin/cvsweb/?cvsroot=mdocml&sortby=date What it does is this: When you run mandoc(1) as you always did, with no new options, it guesses the encoding, transforms UTF-8 or ISO-8859-1 input characters to roff(7) escape sequences on the fly, at the input stage, so (almost) everything is going to magically work as expected. You can specify the input encoding on the command line, with a BOM, or with an emacs mode line, but usually, none of that will be needed to properly handle UTF-8. Oh, but still, even if it seems to work with mandoc(1) in the future, don't start putting non-ASCII characters into manuals! Yours, Ingo P.S. Some additional comments: Raul Miller wrote on Fri, Oct 24, 2014 at 08:21:40PM -0400: > I have been following this list for some time, and I feel I might have > something to add here. Please do not hesitate to tell me to shut up if this > just makes things worse. > > -------------------------- > > Here is a list of things that might be spaces in an otherwise > ascii-appearing document, according to the wikipedia page on > non-breaking space: It is rarely useful to study the union of several different languages. It is almost always better for understanding to agree on one language beforehand and then stick to it. In the case at hand, the language we are talking about is roff(7). In that language, there is exactly one notation for the unpaddable non-breaking space character (`\ ') and exactly one notation for the paddable non-breaking space character (`\~'). That would be 0x5c20 and 0x5c7e, neither of which appears in your list, while none of the sequences in your list are valid roff(7). > Anyways, I guess the point here is that if you "automatically fix" some > documents you will "automatically break" some other documents. In this case, not really. Right now, we just render any non-ASCII byte as a question mark (`?'). So right now, every non-ASCII document renders poorly. From where we stand, it can only get better. Choosing our way, the trick is to follow established choices as far as possible and reasonable. > As a general rule, the best solution is the one that combines > simplicity of documentation with simplicity of implementation, > as well as tolerable failure modes. Indeed. My patch enhances functionality and removes more than 300 lines of code and documentation from portable mandoc. > That said, personally I tend to favor a warning message for the > case where out-of-standard characters are encountered. Yes, that's what i will do. Downgrade most mandoc(1) ERRORs in this area to WARNINGs, except when the document contradicts itself or the command line options, in which cases ERRORs remain appropriate. > But I suppose it's reasonable to create a new standard. > Perhaps "ascii with certain non-breaking spaces"? > There's always room for more standards? Already exists internally in mandoc, look for ASCII_* in mandoc.h. But we certainly don't want to make that a new *public* standard. Besides, all this is not just about spaces, that's merely one example.