> > until we have real 32bit input slots > > I'm not sure what you expect here. The way it's currently done is that > characters with name "uNNNN" are used.
[Please read section `gtroff' internals for more information too.] There are different levels. . The first level is input characters -- a future groff version shall expect UTF-8 which is stored internally as 32bit values (element `c' in class `token', assigned in function `token::next'). Currently, this is an `unsigned char', and all values derived from it (hyphenation, for example) are looking the same. . Next, still on the input level, we have the entity names GNU troff uses for further processing, `A:', `uNNNN', etc. This is represented by the `charinfo' class, eventually collected with function `environment::add_char' to form the current line. Again we have a bottleneck because of the simplistic `charset_table' array, mapping from input characters to the `charinfo' class, which expects 256 elements. . The topmost input level is class `token' which represents all data possible on the input side -- this includes both processed (for example, diversions) and unprocessed data. Its job is to feed everything to the output at the right time. I won't go into details here since no improvements are necessary. > What is the need to use a 32-bit 'int' value for this instead > (except for optimization - and optimizations come afterwards, after > profiling)? I hope I've answered your questions with the above explanations. We need both, the named entity and its corresponding input character code. > This first step is to make the treatment of the Unicode glyphs > algorithmic rather than table-based. I fully agree. > _If_ tables are needed that the user needs to customize - the Asian > double-width property comes to mind: it depends on the terminal > emulator being used - A different terminal emulator represents a different output device in case there are different glyph widths (otherwise troff won't be able to produce justified output). Or do you mean something else? > it should IMO be done through a specialized representation that is > economic both in space in the font file format and in memory, rather > than a representation that enumerates character after character. This is what I mean with `classes', something like this in a font description file, using two new sections: classes <Alike> = A :A 'A `A ... ; <CJKpunct> = U+3000 - U+303F; <Hiragana> = U+3040 - U+309F; ... <CJK> = <CJKpunct> <Hiragana> ... ; properties <CJK> width 24 ... <Alike> kern V -3 I've no idea how to store such information efficiently within memory. Maybe something similar to the `sparse arrays' as used in Emacs... Suggestions? > Also, do you think these glyph classes depend on the font, or only > on the device to which the font belongs? Glyph classes are a property of the font only. Maybe it is useful to provide a generic `glyphclass' file which provides default classes, to be overridden in the particular font, but this is a refinement which we can ignore now. > Thanks for your agreement. Then this will be the next step, after > the patches that I've already submitted. Excellent. > Up to now I didn't even know that these were three different data > types; I was only looking at the font class. Aah, this explains the difficulties I have to answer your questions in a simply way. > I assume, an element of a font - often called "index" - is a glyph. > What is an "output character" then? Just sloppy wording by me :-) Well, TTY devices basically convert troff glyphs back to output characters, but this is just nit-picking. Werner _______________________________________________ Groff mailing list Groff@gnu.org http://lists.gnu.org/mailman/listinfo/groff