Torsten Bögershausen <[email protected]> writes:
> The function git_wcwidth() returns for a given unicode code point the
> width on the display:
> -1 for control characters,
> 0 for combining or other non-visible code points
> 1 for e.g. ASCII
> 2 for double-width code points.
>
> This table had been originally been extracted for one Unicode version,
> probably 3.2.
>
> Make it possible to update the to a later version of Unicode:
>
> Factor out the table from utf8.c into unicode_width.h and
> add the script update_unicode.sh to update the table based on the latest
> Unicode specification files.
>
> Thanks to
> Peter Krefting <[email protected]> and
> Kevin Bracey <[email protected]>
> for helping with their Unicode knowledge
>
> Signed-off-by: Torsten Bögershausen <[email protected]>
> ---
I would say this is a good idea, but a few nitpicks.
> diff --git a/unicode_width.h b/unicode_width.h
> new file mode 100644
> index 0000000..4db7803
> --- /dev/null
> +++ b/unicode_width.h
> @@ -0,0 +1,288 @@
Please update the script (and the resulting file) to caution against
misuse/mismanagement of this file by adding a comment to at least
state:
- This is a generated file and you are not supposed to modify it; and
- This is to be included only once from one place in the code and
that is why this does not use the usual #ifndef X/#define X/#endif
double-inclusion guards.
An obvious and viable alternative to the second would be to do the
usual double-inclusion guard. I do not have much preference either
way.
> +static const struct interval zero_width[] = {
> ...
> +};
> +static const struct interval double_width[] = {
> ...
> +};
> diff --git a/update_unicode.sh b/update_unicode.sh
> new file mode 100755
> index 0000000..000b937
> --- /dev/null
> +++ b/update_unicode.sh
> @@ -0,0 +1,37 @@
> +#!/bin/sh
> +#See http://www.unicode.org/reports/tr44/
> +#
> +#Me Enclosing_Mark an enclosing combining mark
> +#Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
> +#Cf Format a format control character
Please have a SP after # in these comments to make them readable?
> +#
> +UNICODEWIDTH_H=../unicode_width.h
> +if ! test -d unicode; then
> + mkdir unicode
> +fi &&
Style:
if ! test -d unicode
then
...
fi
> +( cd unicode &&
> + if ! test -f UnicodeData.txt; then
> + wget
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> + fi &&
> + if ! test -f EastAsianWidth.txt; then
> + wget
> http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> + fi &&
> + if ! test -d uniset; then
> + git clone https://github.com/depp/uniset.git
> + fi &&
> + (
> + cd uniset &&
> + if ! test -x uniset; then
> + autoreconf -i &&
> + ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
What are these "-O0 -ggdb" about?
> + fi &&
> + make
> + ) &&
> + echo "static const struct interval zero_width[] = {" >$UNICODEWIDTH_H &&
> + UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf + U+1160..U+11FF -
> U+00AD |
> + grep -v plane >>$UNICODEWIDTH_H &&
> + echo "};" >>$UNICODEWIDTH_H &&
> + echo "static const struct interval double_width[] = {"
> >>$UNICODEWIDTH_H &&
> + UNICODE_DIR=. ./uniset/uniset --32 eaw:F,W >>$UNICODEWIDTH_H &&
> + echo "};" >>$UNICODEWIDTH_H
> +)
> @@ -147,7 +90,7 @@ static int git_wcwidth(ucs_char_t ch)
> return -1;
>
> /* binary search in table of non-spacing characters */
> - if (bisearch(ch, combining, sizeof(combining)
> + if (bisearch(ch, zero_width, sizeof(zero_width)
> / sizeof(struct interval) - 1))
To my eyes, that line looks folded at a funny place. I think it is
more conventional to fold after the operator, i.e.
if (bisearch(ch, zero_width, sizeof(zero_width) /
sizeof(struct interval) - 1))
but
if (bisearch(ch, zero_width,
sizeof(zero_width) / sizeof(struct interval) - 1))
may probably be a lot more logical and readable. Maybe it is just
me.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html