BCP 47 is an IETF specification for locale names. Since <https://en.wikipedia.org/wiki/IETF_language_tag> says that this specification is "used by computing standards such as HTTP, HTML, XML and PNG", it makes sense for Gnulib to support it.
Here comes a module 'bcp47' that provides support for it, in the form of conversion function from/to the XPG syntax (generally used by glibc). 2024-10-03 Bruno Haible <br...@clisp.org> bcp47: Add tests. * tests/test-bcp47.c: New file. * modules/bcp47-tests: New file. bcp47: New module. * lib/bcp47.h: New file. * lib/bcp47.c: New file. * modules/bcp47: New file.
From b1648c71c33eaf25bf346871950bd25373734da4 Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Thu, 3 Oct 2024 20:45:08 +0200 Subject: [PATCH 1/2] bcp47: New module. * lib/bcp47.h: New file. * lib/bcp47.c: New file. * modules/bcp47: New file. --- ChangeLog | 7 + lib/bcp47.c | 626 ++++++++++++++++++++++++++++++++++++++++++++++++++ lib/bcp47.h | 75 ++++++ modules/bcp47 | 24 ++ 4 files changed, 732 insertions(+) create mode 100644 lib/bcp47.c create mode 100644 lib/bcp47.h create mode 100644 modules/bcp47 diff --git a/ChangeLog b/ChangeLog index 72a40a6ceb..bbf15b7064 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,10 @@ +2024-10-03 Bruno Haible <br...@clisp.org> + + bcp47: New module. + * lib/bcp47.h: New file. + * lib/bcp47.c: New file. + * modules/bcp47: New file. + 2024-10-02 Collin Funk <collin.fu...@gmail.com> error, verror: Don't call va_end twice. diff --git a/lib/bcp47.c b/lib/bcp47.c new file mode 100644 index 0000000000..8008ac030d --- /dev/null +++ b/lib/bcp47.c @@ -0,0 +1,626 @@ +/* Support for locale names in BCP 47 syntax. + Copyright (C) 2024 Free Software Foundation, Inc. + + This file is free software: you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as + published by the Free Software Foundation, either version 3 of the + License, or (at your option) any later version. + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <br...@clisp.org>, 2024. */ + +#include <config.h> + +/* Specification. */ +#include "bcp47.h" + +#include <string.h> + +#include "c-ctype.h" + +/* The set of XPG locale names is historically grown and emphasizes the region + over the script. In fact, it uses the script only to disambiguate locale + with the same region. + The BCP 47 locale names, on the other hand, emphasize the script over the + region. + + Therefore we add special treatment of all languages that can be written + using different scripts: + - During XPG to BCP 47 conversion, we add the script if not present, + inferring it from the region. + - During BCP 47 to XPG conversion, when a region is provided, we remove + the script if doing so produces a known locale name (i.e. a locale name + present in glibc, since glibc has the most complete set of locales). + + This affects the following languages: + - Azerbaijani (az): Latin in Azerbaijan, Arabic in Iran. + <https://en.wikipedia.org/wiki/Azerbaijani_language> + - Belarusian (be): Assume Cyrillic by default, but Latin exists as well. + <https://en.wikipedia.org/wiki/Belarusian_language#Alphabet> + - Tamazight / Berber (ber): Assume Latin by default, but Arabic exists + as well. + <https://en.wikipedia.org/wiki/Berber_languages> + <https://en.wikipedia.org/wiki/Berber_Latin_alphabet> + <https://en.wikipedia.org/wiki/Tifinagh> + - Bosnian (bs): Assume Latin by default, but Cyrillic exists as well. + <https://en.wikipedia.org/wiki/Bosnian_language> + - Hausa (ha): Assume Latin by default, but Arabic exists as well. + <https://en.wikipedia.org/wiki/Hausa_language> + <https://en.wikipedia.org/wiki/Boko_alphabet> + - Inuktitut (iu): Assume Inuktitut syllabics by default, but Latin + exists as well. + <https://en.wikipedia.org/wiki/Inuktitut#Writing> + <https://en.wikipedia.org/wiki/Inuktitut_syllabics> + - Kazakh (kk): Currently (2024) Cyrillic by default, but migrating to + Latin. + <https://en.wikipedia.org/wiki/Kazakh_language> + - Kashmiri (ks): Assume Arabic by default, but Devanagari exists as well. + <https://en.wikipedia.org/wiki/Kashmiri_language> + - Kurdish (ku): Latin in Türkiye and Syria, Arabic in Iraq and Iran. + <https://en.wikipedia.org/wiki/Kurdish_language> + - Mongolian (mn): Currently (2024) mainly Cyrillic, but the vertically + written Mongolian script is also in use. + <https://en.wikipedia.org/wiki/Mongolian_language> + - Min Nan Chinese (nan): Assume Traditional Chinese by default, but Latin + exists as well. + <https://en.wikipedia.org/wiki/Southern_Min> + - Punjabi (pa): Arabic in Pakistan, Gurmukhi in India. + <https://en.wikipedia.org/wiki/Punjabi_language> + - Sindhi (sd): Arabic in Pakistan, assume Arabic in India as well, but + Devanagari exists in India too. + <https://en.wikipedia.org/wiki/Sindhi_language#Writing_systems> + - Serbian (sr): Assume Cyrillic by default, but Latin exists as well. + <https://en.wikipedia.org/wiki/Serbian_language> + - Uzbek (uz): Assume Latin by default, but Cyrillic exists as well. + <https://en.wikipedia.org/wiki/Uzbek_language> + - Yiddish (yi): Assume Hebrew by default, but Latin exists as well. + <https://en.wikipedia.org/wiki/Yiddish> + - Chinese (zh): Simplified Chinese in PRC and Singapore, + Traditional Chinese elsewhere. + <https://en.wikipedia.org/wiki/Chinese_language> + */ + + +struct script +{ + char name[12]; /* Script name, lowercased, NUL-terminated */ + char code[4]; /* Script code, not NUL-terminated */ +}; + +/* Table of script names and four-letter script codes. + The codes are taken from <https://en.wikipedia.org/wiki/ISO_15924> or + <https://unicode.org/iso15924/iso15924-codes.html>. */ +static const struct script scripts[] = +{ + { "latin", "Latn" }, + { "cyrillic", "Cyrl" }, + { "hebrew", "Hebr" }, + { "arabic", "Arab" }, + { "devanagari", "Deva" }, + { "gurmukhi", "Guru" }, + { "mongolian", "Mong" } +}; +#define NUM_SCRIPTS (sizeof (scripts) / sizeof (scripts[0])) + + +void +xpg_to_bcp47 (char *bcp47, const char *xpg) +{ + /* Special cases. */ + if (strcmp (xpg, "") == 0) + fail: + { + strcpy (bcp47, "und"); + return; + } + if ((xpg[0] == 'C' && (xpg[1] == '\0' || xpg[1] == '.')) + || strcmp (xpg, "POSIX") == 0) + { + /* The "C" (or "C.UTF-8") and "POSIX" locales most closely resemble the + "en_US" locale. */ + strcpy (bcp47, "und"); + return; + } + + /* Parse XPG as language[_territory][.codeset][@modifier]. */ + const char *language_start = NULL; + size_t language_len = 0; + const char *territory_start = NULL; + size_t territory_len = 0; + const char *modifier_start = NULL; + size_t modifier_len = 0; + + { + const char *p; + + p = xpg; + language_start = p; + while (*p != '\0' && *p != '_' && *p != '.' && *p != '@') + p++; + language_len = p - language_start; + if (*p == '_') + { + p++; + territory_start = p; + while (*p != '\0' && *p != '.' && *p != '@') + p++; + territory_len = p - territory_start; + } + if (*p == '.') + { + p++; + while (*p != '\0' && *p != '@') + p++; + } + if (*p == '@') + { + p++; + modifier_start = p; + while (*p != '\0') + p++; + modifier_len = p - modifier_start; + } + } + + if (language_len == 0) + /* No language -> fail. */ + goto fail; + + /* Canonicalize the language. */ + /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the + ISO 639-1 code "qu". */ + if (language_len == 3 && memcmp (language_start, "quz", 3) == 0) + { + language_start = "qu"; + language_len = 2; + } + /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the + ISO 639-2 code "ber". */ + else if (language_len == 3 && memcmp (language_start, "tzm", 3) == 0) + { + language_start = "ber"; + language_len = 3; + } + + const char *script_subtag = NULL; + + /* Determine script from the modifier. */ + if (modifier_len > 0) + { + size_t i; + for (i = 0; i < NUM_SCRIPTS; i++) + if (strlen (scripts[i].name) == modifier_len + && memcmp (scripts[i].name, modifier_start, modifier_len) == 0) + script_subtag = scripts[i].code; + } + + /* Determine script from the language and possibly the territory. */ + if (language_len > 0 && script_subtag == NULL) + { + /* Languages with a script that depends on the territory. */ + if (territory_len > 0) + { + if (language_len == 2) + { + if (memcmp (language_start, "az", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory_start, "AZ", 2) == 0) + script_subtag = "Latn"; + else if (memcmp (territory_start, "IR", 2) == 0) + script_subtag = "Arab"; + } + } + else if (memcmp (language_start, "ku", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory_start, "IQ", 2) == 0 + || memcmp (territory_start, "IR", 2) == 0) + script_subtag = "Arab"; + else if (memcmp (territory_start, "SY", 2) == 0 + || memcmp (territory_start, "TR", 2) == 0) + script_subtag = "Latn"; + } + } + else if (memcmp (language_start, "pa", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory_start, "PK", 2) == 0) + script_subtag = "Arab"; + else if (memcmp (territory_start, "IN", 2) == 0) + script_subtag = "Guru"; + } + } + else if (memcmp (language_start, "zh", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory_start, "CN", 2) == 0 + || memcmp (territory_start, "SG", 2) == 0) + script_subtag = "Hans"; + else + script_subtag = "Hant"; + } + } + } + } + /* Languages with a main script and one or more alternate scripts. */ + if (language_len == 2) + { + if (memcmp (language_start, "be", 2) == 0) + script_subtag = "Cyrl"; + else if (memcmp (language_start, "bs", 2) == 0) + script_subtag = "Latn"; + else if (memcmp (language_start, "ha", 2) == 0) + script_subtag = "Latn"; + else if (memcmp (language_start, "iu", 2) == 0) + script_subtag = "Cans"; + else if (memcmp (language_start, "kk", 2) == 0) + script_subtag = "Cyrl"; + else if (memcmp (language_start, "ks", 2) == 0) + script_subtag = "Arab"; + else if (memcmp (language_start, "mn", 2) == 0) + script_subtag = "Cyrl"; + else if (memcmp (language_start, "sd", 2) == 0) + script_subtag = "Arab"; + else if (memcmp (language_start, "sr", 2) == 0) + script_subtag = "Cyrl"; + else if (memcmp (language_start, "uz", 2) == 0) + script_subtag = "Latn"; + else if (memcmp (language_start, "yi", 2) == 0) + script_subtag = "Hebr"; + } + else if (language_len == 3) + { + if (memcmp (language_start, "ber", 3) == 0) + script_subtag = "Latn"; + else if (memcmp (language_start, "nan", 3) == 0) + script_subtag = "Hant"; + } + } + + /* Construct the result: language[-script][-territory]. */ + if (language_len + + (script_subtag != NULL ? 1 + 4 : 0) + + (territory_len > 0 ? 1 + territory_len : 0) + < BCP47_MAX) + { + char *q = bcp47; + memcpy (q, language_start, language_len); + q += language_len; + if (script_subtag != NULL) + { + *q++ = '-'; + memcpy (q, script_subtag, 4); + q += 4; + } + if (territory_len > 0) + { + *q++ = '-'; + memcpy (q, territory_start, territory_len); + q += territory_len; + } + *q = '\0'; + return; + } + else + goto fail; +} + +void +bcp47_to_xpg (char *xpg, const char *bcp47, const char *codeset) +{ + /* Special cases. */ + if (strcmp (bcp47, "") == 0) + fail: + { + strcpy (xpg, ""); + return; + } + + /* Parse BCP47 as + language{-extlang}*[-script][-region]{-variant}*{-extension}*. */ + const char *language_start = NULL; + size_t language_len = 0; + const char *script_start = NULL; + size_t script_len = 0; + const char *region_start = NULL; + size_t region_len = 0; + + { + bool past_script = false; + bool past_region = false; + const char *p; + + p = bcp47; + language_start = p; + while (*p != '\0' && *p != '-') + p++; + language_len = p - language_start; + while (*p != '\0') + { + if (*p == '-') + { + p++; + const char *subtag_start = p; + while (*p != '\0' && *p != '-') + p++; + size_t subtag_len = p - subtag_start; + + if (!past_script && subtag_len == 4) + { + /* Parsed -script. */ + script_start = subtag_start; + script_len = subtag_len; + past_script = true; + } + else if (!past_region + && (subtag_len == 2 + || (subtag_len == 3 + && subtag_start[0] >= '0' && subtag_start[0] <= '9' + && subtag_start[1] >= '0' && subtag_start[1] <= '9' + && subtag_start[2] >= '0' && subtag_start[2] <= '9'))) + { + /* Parsed -region. */ + region_start = subtag_start; + region_len = subtag_len; + past_region = true; + past_script = true; + } + else + { + /* Is it -extlang or -variant or -extension? */ + if (!past_script && subtag_len == 3) + { + /* It is -extlang. */ + } + else + { + /* It must be -variant or -extension. */ + past_script = true; + past_region = true; + } + } + } + } + } + + if (language_len == 0 || language_len >= BCP47_MAX) + /* No language or too long -> fail. */ + goto fail; + + /* Copy the language to the result buffer, converting it to lower case. */ + { + size_t i; + for (i = 0; i < language_len; i++) + xpg[i] = c_tolower (language_start[i]); + } + + /* Canonicalize the language. */ + /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the + ISO 639-1 code "qu". */ + if (language_len == 3 && memcmp (xpg, "quz", 3) == 0) + { + language_len = 2; + memcpy (xpg, "qu", language_len); + } + /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the + ISO 639-2 code "ber". */ + else if (language_len == 3 && memcmp (xpg, "tzm", 3) == 0) + { + language_len = 3; + memcpy (xpg, "ber", language_len); + } + + /* Copy the region to a temporary buffer, converting it to upper case. */ + char territory[3]; + size_t territory_len = region_len; /* == 2 or 3 */ + { + size_t i; + for (i = 0; i < region_len; i++) + territory[i] = c_toupper (region_start[i]); + } + + /* Determine script from the script subtag. */ + const char *script = NULL; + + if (script_len > 0) + { + /* Here script_len == 4. */ + size_t i; + for (i = 0; i < NUM_SCRIPTS; i++) + if (c_toupper (script_start[0] == scripts[i].code[0]) + && c_tolower (script_start[1] == scripts[i].code[1]) + && c_tolower (script_start[2] == scripts[i].code[2]) + && c_tolower (script_start[3] == scripts[i].code[3])) + script = scripts[i].name; + } + + /* Possibly strip away the script, depending on the language and possibly + the territory. */ + if (script != NULL) + { + /* Languages with a script that depends on the territory. */ + if (territory_len > 0) + { + if (language_len == 2) + { + if (memcmp (xpg, "az", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory, "AZ", 2) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + else if (memcmp (territory, "IR", 2) == 0) + { + if (strcmp (script, "arabic") == 0) + script = NULL; + } + } + } + else if (memcmp (xpg, "ku", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory, "IQ", 2) == 0 + || memcmp (territory, "IR", 2) == 0) + { + if (strcmp (script, "arabic") == 0) + script = NULL; + } + else if (memcmp (territory, "SY", 2) == 0 + || memcmp (territory, "TR", 2) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + } + } + else if (memcmp (xpg, "pa", 2) == 0) + { + if (territory_len == 2) + { + if (memcmp (territory, "PK", 2) == 0) + { + if (strcmp (script, "arabic") == 0) + script = NULL; + } + else if (memcmp (territory, "IN", 2) == 0) + { + if (strcmp (script, "gurmukhi") == 0) + script = NULL; + } + } + } + else if (memcmp (xpg, "zh", 2) == 0) + { + /* "Hans" and "Hant" are not present in the scripts[] table, + therefore nothing to do here. */ + } + } + } + /* Languages with a main script and one or more alternate scripts. */ + if (language_len == 2) + { + if (memcmp (xpg, "be", 2) == 0) + { + if (strcmp (script, "cyrillic") == 0) + script = NULL; + } + else if (memcmp (xpg, "bs", 2) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + else if (memcmp (xpg, "ha", 2) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + else if (memcmp (xpg, "iu", 2) == 0) + { + /* "Cans" is not present in the scripts[] table, + therefore nothing to do here. */ + } + else if (memcmp (xpg, "kk", 2) == 0) + { + if (strcmp (script, "cyrillic") == 0) + script = NULL; + } + else if (memcmp (xpg, "ks", 2) == 0) + { + if (strcmp (script, "arabic") == 0) + script = NULL; + } + else if (memcmp (xpg, "mn", 2) == 0) + { + if (strcmp (script, "cyrillic") == 0) + script = NULL; + } + else if (memcmp (xpg, "sd", 2) == 0) + { + if (strcmp (script, "arabic") == 0) + script = NULL; + } + else if (memcmp (xpg, "sr", 2) == 0) + { + if (strcmp (script, "cyrillic") == 0) + script = NULL; + } + else if (memcmp (xpg, "uz", 2) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + else if (memcmp (xpg, "yi", 2) == 0) + { + if (strcmp (script, "hebrew") == 0) + script = NULL; + } + } + else if (language_len == 3) + { + if (memcmp (xpg, "ber", 3) == 0) + { + if (strcmp (script, "latin") == 0) + script = NULL; + } + else if (memcmp (xpg, "nan", 3) == 0) + { + /* "Hant" is not present in the scripts[] table, + therefore nothing to do here. */ + } + } + } + + /* The modifier is the script. */ + const char *modifier = script; + + /* Construct the result: language[_territory][.codeset][@modifier]. */ + size_t codeset_len = (codeset != NULL ? strlen (codeset) : 0); + size_t modifier_len = (modifier != NULL ? strlen (modifier) : 0); + if (language_len + + (territory_len > 0 ? 1 + territory_len : 0) + + (codeset != NULL ? 1 + codeset_len : 0) + + (modifier != NULL ? 1 + modifier_len : 0) + < BCP47_MAX) + { + char *q = xpg; + q += language_len; + if (territory_len > 0) + { + *q++ = '_'; + memcpy (q, territory, territory_len); + q += territory_len; + } + if (codeset != NULL) + { + *q++ = '.'; + memcpy (q, codeset, codeset_len); + q += codeset_len; + } + if (modifier != NULL) + { + *q++ = '@'; + memcpy (q, modifier, modifier_len); + q += modifier_len; + } + *q = '\0'; + return; + } + else + goto fail; +} diff --git a/lib/bcp47.h b/lib/bcp47.h new file mode 100644 index 0000000000..3a044dfd29 --- /dev/null +++ b/lib/bcp47.h @@ -0,0 +1,75 @@ +/* Support for locale names in BCP 47 syntax. + Copyright (C) 2024 Free Software Foundation, Inc. + + This file is free software: you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as + published by the Free Software Foundation, either version 3 of the + License, or (at your option) any later version. + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <br...@clisp.org>, 2024. */ + +#ifndef _BCP47_H +#define _BCP47_H + +/* A locale name can exist in three possible forms: + + * The XPG syntax + language[_territory][.codeset][@modifier] + where + - The language is an ISO 639 (two-letter) language code. + - The territory is an ISO 3166 (two-letter) country code. + - The codeset is typically UTF-8. + - The supported @modifiers are usually something like + @euro + a script indicator, such as: @latin, @cyrillic, @devanagari + + * The locale name understood by setlocale(). + On glibc and many other Unix-like systems, this is the XPG syntax. + On native Windows, it is similar to XPG syntax, with English names + (instead of ISO codes) for the language and territory and with a + number for the codeset (e.g. 65001 for UTF-8). + + * The BCP 47 syntax + language[-script][-region]{-variant}*{-extension}* + defined in + <https://www.ietf.org/rfc/bcp/bcp47.html> + = <https://www.rfc-editor.org/rfc/bcp/bcp47.txt> + which consists of RFC 5646 and RFC 4647. + See also <https://en.wikipedia.org/wiki/IETF_language_tag>. + Note: The BCP 47 syntax does not include a codeset. + + This file provides conversions between the XPG syntax and the BCP 47 + syntax. */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Required size of buffer for a locale name. */ +#define BCP47_MAX 100 + +/* Converts a locale name in XPG syntax to a locale name in BCP 47 syntax. + Returns the result in bcp47, which must be at least BCP47_MAX bytes + large. */ +extern void xpg_to_bcp47 (char *bcp47, const char *xpg); + +/* Converts a locale name in BCP 47 syntax (optionally with a codeset) + to a locale name in XPG syntax. + The specified codeset may be NULL. + Returns the result in xpg, which must be at least BCP47_MAX bytes + large. */ +extern void bcp47_to_xpg (char *xpg, const char *bcp47, const char *codeset); + +#ifdef __cplusplus +} +#endif + +#endif /* _BCP47_H */ diff --git a/modules/bcp47 b/modules/bcp47 new file mode 100644 index 0000000000..ca0ac88287 --- /dev/null +++ b/modules/bcp47 @@ -0,0 +1,24 @@ +Description: +Support for locale names in BCP 47 syntax. + +Files: +lib/bcp47.h +lib/bcp47.c + +Depends-on: +stdbool +c-ctype + +configure.ac: + +Makefile.am: +lib_SOURCES += bcp47.c + +Include: +"bcp47.h" + +License: +LGPL + +Maintainer: +all -- 2.34.1
>From 588b8d0c0cfe366c4fa24a367767b385edb62fec Mon Sep 17 00:00:00 2001 From: Bruno Haible <br...@clisp.org> Date: Thu, 3 Oct 2024 20:45:59 +0200 Subject: [PATCH 2/2] bcp47: Add tests. * tests/test-bcp47.c: New file. * modules/bcp47-tests: New file. --- ChangeLog | 4 + modules/bcp47-tests | 11 +++ tests/test-bcp47.c | 200 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 215 insertions(+) create mode 100644 modules/bcp47-tests create mode 100644 tests/test-bcp47.c diff --git a/ChangeLog b/ChangeLog index bbf15b7064..22a40c3d58 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,9 @@ 2024-10-03 Bruno Haible <br...@clisp.org> + bcp47: Add tests. + * tests/test-bcp47.c: New file. + * modules/bcp47-tests: New file. + bcp47: New module. * lib/bcp47.h: New file. * lib/bcp47.c: New file. diff --git a/modules/bcp47-tests b/modules/bcp47-tests new file mode 100644 index 0000000000..9abd149a93 --- /dev/null +++ b/modules/bcp47-tests @@ -0,0 +1,11 @@ +Files: +tests/test-bcp47.c +tests/macros.h + +Depends-on: + +configure.ac: + +Makefile.am: +TESTS += test-bcp47 +check_PROGRAMS += test-bcp47 diff --git a/tests/test-bcp47.c b/tests/test-bcp47.c new file mode 100644 index 0000000000..c0efeea3a7 --- /dev/null +++ b/tests/test-bcp47.c @@ -0,0 +1,200 @@ +/* Test support for locale names in BCP 47 syntax. + Copyright (C) 2024 Free Software Foundation, Inc. + + This file is free software: you can redistribute it and/or modify + it under the terms of the GNU Lesser General Public License as + published by the Free Software Foundation, either version 3 of the + License, or (at your option) any later version. + + This file is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public License + along with this program. If not, see <https://www.gnu.org/licenses/>. */ + +/* Written by Bruno Haible <br...@clisp.org>, 2024. */ + +#include <config.h> + +#include "bcp47.h" + +#include <string.h> + +#include "macros.h" + +static void +test_correspondence (const char *xpg, const char *bcp47) +{ + /* Test xpg_to_bcp47. */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + xpg_to_bcp47 (buf, xpg); + ASSERT (strcmp (buf, bcp47) == 0); + } + + /* Test bcp47_to_xpg. */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, bcp47, NULL); + ASSERT (strcmp (buf, xpg) == 0); + } +} + +int +main () +{ + /* Languages with a single script. */ + + test_correspondence ("de", "de"); + test_correspondence ("de_DE", "de-DE"); + test_correspondence ("de_AT", "de-AT"); + + /* Languages with a script that depends on the territory. */ + + test_correspondence ("az_AZ", "az-Latn-AZ"); + test_correspondence ("az_AZ@cyrillic", "az-Cyrl-AZ"); + test_correspondence ("az_IR", "az-Arab-IR"); + + test_correspondence ("ku_IQ", "ku-Arab-IQ"); + test_correspondence ("ku_IR", "ku-Arab-IR"); + test_correspondence ("ku_SY", "ku-Latn-SY"); + test_correspondence ("ku_TR", "ku-Latn-TR"); + + test_correspondence ("pa_PK", "pa-Arab-PK"); + test_correspondence ("pa_IN", "pa-Guru-IN"); + + test_correspondence ("zh_CN", "zh-Hans-CN"); + test_correspondence ("zh_HK", "zh-Hant-HK"); + test_correspondence ("zh_MO", "zh-Hant-MO"); + test_correspondence ("zh_SG", "zh-Hans-SG"); + test_correspondence ("zh_TW", "zh-Hant-TW"); + + /* Languages with a main script and one or more alternate scripts. */ + + test_correspondence ("be_BY", "be-Cyrl-BY"); + test_correspondence ("be_BY@latin", "be-Latn-BY"); + + test_correspondence ("ber@arabic", "ber-Arab"); + test_correspondence ("ber", "ber-Latn"); + test_correspondence ("ber_DZ", "ber-Latn-DZ"); + test_correspondence ("ber_MA", "ber-Latn-MA"); + + test_correspondence ("bs_BA", "bs-Latn-BA"); + test_correspondence ("bs_BA@cyrillic", "bs-Cyrl-BA"); + + test_correspondence ("ha_NG", "ha-Latn-NG"); + test_correspondence ("ha_NG@arabic", "ha-Arab-NG"); + + test_correspondence ("iu_CA", "iu-Cans-CA"); + test_correspondence ("iu_CA@latin", "iu-Latn-CA"); + + test_correspondence ("kk_KZ", "kk-Cyrl-KZ"); + test_correspondence ("kk_KZ@latin", "kk-Latn-KZ"); + + test_correspondence ("ks_IN", "ks-Arab-IN"); + test_correspondence ("ks_IN@devanagari", "ks-Deva-IN"); + + test_correspondence ("mn_MN", "mn-Cyrl-MN"); + test_correspondence ("mn_MN@mongolian", "mn-Mong-MN"); + + test_correspondence ("nan_TW", "nan-Hant-TW"); + test_correspondence ("nan_TW@latin", "nan-Latn-TW"); + + test_correspondence ("sd_PK", "sd-Arab-PK"); + test_correspondence ("sd_IN", "sd-Arab-IN"); + test_correspondence ("sd_IN@devanagari", "sd-Deva-IN"); + + test_correspondence ("sr_BA@latin", "sr-Latn-BA"); + test_correspondence ("sr_BA", "sr-Cyrl-BA"); + test_correspondence ("sr_RS", "sr-Cyrl-RS"); + test_correspondence ("sr_RS@latin", "sr-Latn-RS"); + + test_correspondence ("uz_UZ", "uz-Latn-UZ"); + test_correspondence ("uz_UZ@cyrillic", "uz-Cyrl-UZ"); + + test_correspondence ("yi_US", "yi-Hebr-US"); + test_correspondence ("yi_US@latin", "yi-Latn-US"); + + /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the + ISO 639-1 code "qu". */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "quz-PE", NULL); + ASSERT (strcmp (buf, "qu_PE") == 0); + } + + /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the + ISO 639-2 code "ber". */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "tzm-MA", NULL); + ASSERT (strcmp (buf, "ber_MA") == 0); + } + + /* Test xpg_to_bcp47 with an encoding. */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + xpg_to_bcp47 (buf, "en_US.UTF-8"); + ASSERT (strcmp (buf, "en-US") == 0); + } + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + xpg_to_bcp47 (buf, "az_AZ.UTF-8@cyrillic"); + ASSERT (strcmp (buf, "az-Cyrl-AZ") == 0); + } + + /* Test bcp47_to_xpg with an encoding. */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "en-US", "UTF-8"); + ASSERT (strcmp (buf, "en_US.UTF-8") == 0); + } + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "az-Cyrl-AZ", "UTF-8"); + ASSERT (strcmp (buf, "az_AZ.UTF-8@cyrillic") == 0); + } + + /* Test case mapping done by bcp47_to_xpg. */ + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "EN-US", "UTF-8"); + ASSERT (strcmp (buf, "en_US.UTF-8") == 0); + } + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "en-us", "UTF-8"); + ASSERT (strcmp (buf, "en_US.UTF-8") == 0); + } + { + char buf[BCP47_MAX]; + memset (buf, 0x77, BCP47_MAX); + + bcp47_to_xpg (buf, "Zh-hANs-cN", "UTF-8"); + ASSERT (strcmp (buf, "zh_CN.UTF-8") == 0); + } + + return test_exit_status; +} -- 2.34.1