Re: [Groff] .hcode request with german umlauts inside utf8 input file

Ralph Corderoy Mon, 28 Jul 2014 06:27:08 -0700

Hi Carsten,

> Now the error message is "hyphenation code must be ordinary
> character". So I understand that the only correct file enocding for
> .hcode with umlauts is latin1 (ISO 8859-1)? Or is there any chance to
> use 7-bit input like \[uXXXX]?
> 
> $ printf ".hcode ä ä"|preconv -e utf-8|troff
> 
> Prints error "hyphenation code must be ordinary character"


No, it looks like you're right.  `info groff' says

    -- Request: .hcode c1 code1 [c2 code2 ...]

    Set the hyphenation code of character C1 to CODE1, that of C2 to
    CODE2, etc.  A hyphenation code must be a single input character
    (not a special character) other than a digit or a space.

    To make hyphenation work, hyphenation codes must be set up.  At
    start-up, groff only assigns hyphenation codes to the letters
    `a'-`z' (mapped to themselves) and to the letters `A'-`Z' (mapped to
    `a'-`z'); all other hyphenation codes are set to zero.  Normally,
    hyphenation patterns contain only lowercase letters which should be
    applied regardless of case.  In other words, the words `FOO' and
    `Foo' should be hyphenated exactly the same way as the word `foo' is
    hyphenated, and this is what `hcode' is good for.  Words which
    contain other letters won't be hyphenated properly if the
    corresponding hyphenation patterns actually do contain them.  For
    example, the following `hcode' requests are necessary to assign
    hyphenation codes to the letters `ÄäÖöÜüß' (this is needed for
    German):

        .hcode ä ä  Ä ä
        .hcode ö ö  Ö ö
        .hcode ü ü  Ü ü
        .hcode ß ß

    Without those assignments, groff treats German words like
    `Kindergärten' (the plural form of `kindergarten') as two substrings
    `kinderg' and `rten' because the hyphenation code of the umlaut a is
    zero by default.  There is a German hyphenation pattern which covers
    `kinder', so groff finds the hyphenation `kin-der'.  The other two
    hyphenation points (`kin-der-gär-ten') are missed.

    This request is ignored if it has no parameter.

So it isn't happy with the \[] that preconv is producing.

    $ echo .hcode ä ä | preconv -e utf-8
    .lf 1 -
    .hcode \[u00E4] \[u00E4]
    $

Werner, is it a preconv bug that it doesn't produce ISO-8859-1 (latin1)
output where possible, e.g. ä rather than \[u00E4], given that's groff's
default input encoding?  It stops it being used for .hcode.

One could post-process preconv's output if \u[00..] doesn't occur
without meaning a byte of that value.

    $ echo .hcode ä ä |
    > preconv -e utf-8 |
    > perl -pe 's/\\\[u00([\dABCDEF]{2})]/chr hex $1/ge' |
    > recode iso-8859-1..dump
    UCS2   Mne   Description

    002E   .     full stop
    006C   l     latin small letter l
    0066   f     latin small letter f
    0020   SP    space
    0031   1     digit one
    0020   SP    space
    002D   -     hyphen-minus
    000A   LF    line feed (lf)
    002E   .     full stop
    0068   h     latin small letter h
    0063   c     latin small letter c
    006F   o     latin small letter o
    0064   d     latin small letter d
    0065   e     latin small letter e
    0020   SP    space
    00E4   a:    latin small letter a with diaeresis
    0020   SP    space
    00E4   a:    latin small letter a with diaeresis
    000A   LF    line feed (lf)
    $

Cheers, Ralph.

Re: [Groff] .hcode request with german umlauts inside utf8 input file

Reply via email to