Re: [Groff] unicode support - questions

Werner LEMBERG Wed, 25 Jan 2006 11:15:58 -0800

> > until we have real 32bit input slots
> 
> I'm not sure what you expect here. The way it's currently done is that
> characters with name "uNNNN" are used.


[Please read section

   `gtroff' internals

 for more information too.]

There are different levels.

  . The first level is input characters -- a future groff version
    shall expect UTF-8 which is stored internally as 32bit values
    (element `c' in class `token', assigned in function
    `token::next').

    Currently, this is an `unsigned char', and all values derived from
    it (hyphenation, for example) are looking the same.

  . Next, still on the input level, we have the entity names GNU troff
    uses for further processing, `A:', `uNNNN', etc.  This is
    represented by the `charinfo' class, eventually collected with
    function `environment::add_char' to form the current line.

    Again we have a bottleneck because of the simplistic
    `charset_table' array, mapping from input characters to the
    `charinfo' class, which expects 256 elements.

  . The topmost input level is class `token' which represents all data
    possible on the input side -- this includes both processed (for
    example, diversions) and unprocessed data.  Its job is to feed
    everything to the output at the right time.  I won't go into
    details here since no improvements are necessary.

> What is the need to use a 32-bit 'int' value for this instead
> (except for optimization - and optimizations come afterwards, after
> profiling)?

I hope I've answered your questions with the above explanations.  We
need both, the named entity and its corresponding input character
code.

> This first step is to make the treatment of the Unicode glyphs
> algorithmic rather than table-based.

I fully agree.

> _If_ tables are needed that the user needs to customize - the Asian
> double-width property comes to mind: it depends on the terminal
> emulator being used -

A different terminal emulator represents a different output device in
case there are different glyph widths (otherwise troff won't be able
to produce justified output).  Or do you mean something else?

> it should IMO be done through a specialized representation that is
> economic both in space in the font file format and in memory, rather
> than a representation that enumerates character after character.

This is what I mean with `classes', something like this in a font
description file, using two new sections:

  classes
    <Alike> = A :A 'A `A ... ;
    <CJKpunct> = U+3000 - U+303F;
    <Hiragana> = U+3040 - U+309F;
    ...

    <CJK> = <CJKpunct> <Hiragana> ... ;

  properties
    <CJK> width 24
    ...
    <Alike> kern V -3

I've no idea how to store such information efficiently within memory.
Maybe something similar to the `sparse arrays' as used in Emacs...
Suggestions?

> Also, do you think these glyph classes depend on the font, or only
> on the device to which the font belongs?

Glyph classes are a property of the font only.  Maybe it is useful to
provide a generic `glyphclass' file which provides default classes, to
be overridden in the particular font, but this is a refinement which
we can ignore now.

> Thanks for your agreement.  Then this will be the next step, after
> the patches that I've already submitted.

Excellent.

> Up to now I didn't even know that these were three different data
> types; I was only looking at the font class.

Aah, this explains the difficulties I have to answer your questions in
a simply way.

> I assume, an element of a font - often called "index" - is a glyph.
> What is an "output character" then?

Just sloppy wording by me :-) Well, TTY devices basically convert
troff glyphs back to output characters, but this is just nit-picking.


    Werner


_______________________________________________
Groff mailing list
Groff@gnu.org
http://lists.gnu.org/mailman/listinfo/groff

Re: [Groff] unicode support - questions

Reply via email to