On Thu, May 18, 2006 at 12:23:58AM +0300, Eugeniy Meshcheryakov wrote: > 17 травня 2006 о 22:00 +0200 Jens Seidel написав(-ла): > > On Fri, May 12, 2006 at 08:19:50PM +0300, Eugeniy Meshcheryakov wrote: > > UTF-8 support is really a nice add on, even if it should not be > > necessary to support Ukrainian language as Russian demonstrates and > > considering the fact that Ukrainian shares the same alphabet as Russian > > (except of course the additional i/I character). > ..and є/Є, and ї/Ї, and ґ/Ґ.
Thanks, I didn't know this. Even my Russian colleagues didn't know this IIRC. > Different characters is not the biggest > problem. Unicode makes it possible to use more characters (like em-dash > or quotation marks) than 8-bit encoding (like KOI8-U), and next stable > Debian release is going to be UTF-8 by default. So I think UTF-8 is a > good choice. Agreed. > > The only problem I could imagine is that SGML will not or wrongly complain > > about > > invalid characters. I have to check this. > > > > > -DESCSET 128 32 UNUSED > > > +DESCSET 128 32 32 > > > @@ -23,10 +23,7 @@ > > > SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 > > > 10 11 12 13 14 15 16 17 18 19 > > > 20 21 22 23 24 25 26 27 28 29 > > > - 30 31 127 128 129 > > > - 130 131 132 133 134 135 136 137 138 139 > > > - 140 141 142 143 144 145 146 147 148 149 > > > - 150 151 152 153 154 155 156 157 158 159 > > > + 30 31 127 > > > > A stupid question from my side, but could you please explain this? > > That's Ardo's code and I'm not familiar with it. > This part of patch fixes problem that sgml processor complains about bad > characters in UTF-8 text (at least written in Ukrainian). Yep. > Second part tells sgml processor to not ignore characters in range > 128-159. > > So effect of those two parts is - sgml processor handles characters with > codes 128-159 as usuall (allowed) characters. OK. But 0-31 and 127 are still rejected, right? I assume these numbers to not refer to UTF-8 characters but to single bytes. This makes UTF-8 characters consisting of two bytes with a second byte of this range invalid!? Can you confirm this? On the other side these characters are currrently not supported at all. Any reason not to remove 0-31 and 127 as well (except that it would be accepted in the first byte as well which is bad)? > > I wonder why you do not add a 8 bit encoding as well, but maybe it should > It can be done, but I do not see good reason to do this. If someone need > to have sgml *source* in other encoding, support for this can be added > later. Right. > > > + 'pdfhyperref' => 'unicode' > > > > If I remember correctly this is only supported in Acrobat to properly > > show bookmarks. xpdf and other PDF viewer just display garbage > > (independent of the unicode option). > As I can see that bookmarks are supported in evince too. There is Thanks, I didn't know this. > also a patch for xpdf but I did not try it. And you are right, without > this option all viewers will display garbage. Jens