new module 'bcp47'

Bruno Haible Thu, 03 Oct 2024 13:18:05 -0700

BCP 47 is an IETF specification for locale names. Since
<https://en.wikipedia.org/wiki/IETF_language_tag> says that this specification
is "used by computing standards such as HTTP, HTML, XML and PNG", it makes
sense for Gnulib to support it.


Here comes a module 'bcp47' that provides support for it, in the form of
conversion function from/to the XPG syntax (generally used by glibc).


2024-10-03  Bruno Haible  <br...@clisp.org>

        bcp47: Add tests.
        * tests/test-bcp47.c: New file.
        * modules/bcp47-tests: New file.

        bcp47: New module.
        * lib/bcp47.h: New file.
        * lib/bcp47.c: New file.
        * modules/bcp47: New file.

From b1648c71c33eaf25bf346871950bd25373734da4 Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Thu, 3 Oct 2024 20:45:08 +0200
Subject: [PATCH 1/2] bcp47: New module.

* lib/bcp47.h: New file.
* lib/bcp47.c: New file.
* modules/bcp47: New file.
---
 ChangeLog     |   7 +
 lib/bcp47.c   | 626 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/bcp47.h   |  75 ++++++
 modules/bcp47 |  24 ++
 4 files changed, 732 insertions(+)
 create mode 100644 lib/bcp47.c
 create mode 100644 lib/bcp47.h
 create mode 100644 modules/bcp47

diff --git a/ChangeLog b/ChangeLog
index 72a40a6ceb..bbf15b7064 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2024-10-03  Bruno Haible  <br...@clisp.org>
+
+	bcp47: New module.
+	* lib/bcp47.h: New file.
+	* lib/bcp47.c: New file.
+	* modules/bcp47: New file.
+
 2024-10-02  Collin Funk  <collin.fu...@gmail.com>
 
 	error, verror: Don't call va_end twice.
diff --git a/lib/bcp47.c b/lib/bcp47.c
new file mode 100644
index 0000000000..8008ac030d
--- /dev/null
+++ b/lib/bcp47.c
@@ -0,0 +1,626 @@
+/* Support for locale names in BCP 47 syntax.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation, either version 3 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <br...@clisp.org>, 2024.  */
+
+#include <config.h>
+
+/* Specification.  */
+#include "bcp47.h"
+
+#include <string.h>
+
+#include "c-ctype.h"
+
+/* The set of XPG locale names is historically grown and emphasizes the region
+   over the script.  In fact, it uses the script only to disambiguate locale
+   with the same region.
+   The BCP 47 locale names, on the other hand, emphasize the script over the
+   region.
+
+   Therefore we add special treatment of all languages that can be written
+   using different scripts:
+     - During XPG to BCP 47 conversion, we add the script if not present,
+       inferring it from the region.
+     - During BCP 47 to XPG conversion, when a region is provided, we remove
+       the script if doing so produces a known locale name (i.e. a locale name
+       present in glibc, since glibc has the most complete set of locales).
+
+   This affects the following languages:
+     - Azerbaijani (az): Latin in Azerbaijan, Arabic in Iran.
+       <https://en.wikipedia.org/wiki/Azerbaijani_language>
+     - Belarusian (be): Assume Cyrillic by default, but Latin exists as well.
+       <https://en.wikipedia.org/wiki/Belarusian_language#Alphabet>
+     - Tamazight / Berber (ber): Assume Latin by default, but Arabic exists
+       as well.
+       <https://en.wikipedia.org/wiki/Berber_languages>
+       <https://en.wikipedia.org/wiki/Berber_Latin_alphabet>
+       <https://en.wikipedia.org/wiki/Tifinagh>
+     - Bosnian (bs): Assume Latin by default, but Cyrillic exists as well.
+       <https://en.wikipedia.org/wiki/Bosnian_language>
+     - Hausa (ha): Assume Latin by default, but Arabic exists as well.
+       <https://en.wikipedia.org/wiki/Hausa_language>
+       <https://en.wikipedia.org/wiki/Boko_alphabet>
+     - Inuktitut (iu): Assume Inuktitut syllabics by default, but Latin
+       exists as well.
+       <https://en.wikipedia.org/wiki/Inuktitut#Writing>
+       <https://en.wikipedia.org/wiki/Inuktitut_syllabics>
+     - Kazakh (kk): Currently (2024) Cyrillic by default, but migrating to
+       Latin.
+       <https://en.wikipedia.org/wiki/Kazakh_language>
+     - Kashmiri (ks): Assume Arabic by default, but Devanagari exists as well.
+       <https://en.wikipedia.org/wiki/Kashmiri_language>
+     - Kurdish (ku): Latin in Türkiye and Syria, Arabic in Iraq and Iran.
+       <https://en.wikipedia.org/wiki/Kurdish_language>
+     - Mongolian (mn): Currently (2024) mainly Cyrillic, but the vertically
+       written Mongolian script is also in use.
+       <https://en.wikipedia.org/wiki/Mongolian_language>
+     - Min Nan Chinese (nan): Assume Traditional Chinese by default, but Latin
+       exists as well.
+       <https://en.wikipedia.org/wiki/Southern_Min>
+     - Punjabi (pa): Arabic in Pakistan, Gurmukhi in India.
+       <https://en.wikipedia.org/wiki/Punjabi_language>
+     - Sindhi (sd): Arabic in Pakistan, assume Arabic in India as well, but
+       Devanagari exists in India too.
+       <https://en.wikipedia.org/wiki/Sindhi_language#Writing_systems>
+     - Serbian (sr): Assume Cyrillic by default, but Latin exists as well.
+       <https://en.wikipedia.org/wiki/Serbian_language>
+     - Uzbek (uz): Assume Latin by default, but Cyrillic exists as well.
+       <https://en.wikipedia.org/wiki/Uzbek_language>
+     - Yiddish (yi): Assume Hebrew by default, but Latin exists as well.
+       <https://en.wikipedia.org/wiki/Yiddish>
+     - Chinese (zh): Simplified Chinese in PRC and Singapore,
+       Traditional Chinese elsewhere.
+       <https://en.wikipedia.org/wiki/Chinese_language>
+ */
+
+
+struct script
+{
+  char name[12]; /* Script name, lowercased, NUL-terminated */
+  char code[4];  /* Script code, not NUL-terminated */
+};
+
+/* Table of script names and four-letter script codes.
+   The codes are taken from <https://en.wikipedia.org/wiki/ISO_15924> or
+   <https://unicode.org/iso15924/iso15924-codes.html>.  */
+static const struct script scripts[] =
+{
+  { "latin",      "Latn" },
+  { "cyrillic",   "Cyrl" },
+  { "hebrew",     "Hebr" },
+  { "arabic",     "Arab" },
+  { "devanagari", "Deva" },
+  { "gurmukhi",   "Guru" },
+  { "mongolian",  "Mong" }
+};
+#define NUM_SCRIPTS (sizeof (scripts) / sizeof (scripts[0]))
+
+
+void
+xpg_to_bcp47 (char *bcp47, const char *xpg)
+{
+  /* Special cases.  */
+  if (strcmp (xpg, "") == 0)
+   fail:
+    {
+      strcpy (bcp47, "und");
+      return;
+    }
+  if ((xpg[0] == 'C' && (xpg[1] == '\0' || xpg[1] == '.'))
+      || strcmp (xpg, "POSIX") == 0)
+    {
+      /* The "C" (or "C.UTF-8") and "POSIX" locales most closely resemble the
+         "en_US" locale.  */
+      strcpy (bcp47, "und");
+      return;
+    }
+
+  /* Parse XPG as language[_territory][.codeset][@modifier].  */
+  const char *language_start = NULL;
+  size_t language_len = 0;
+  const char *territory_start = NULL;
+  size_t territory_len = 0;
+  const char *modifier_start = NULL;
+  size_t modifier_len = 0;
+
+  {
+    const char *p;
+
+    p = xpg;
+    language_start = p;
+    while (*p != '\0' && *p != '_' && *p != '.' && *p != '@')
+      p++;
+    language_len = p - language_start;
+    if (*p == '_')
+      {
+        p++;
+        territory_start = p;
+        while (*p != '\0' && *p != '.' && *p != '@')
+          p++;
+        territory_len = p - territory_start;
+      }
+    if (*p == '.')
+      {
+        p++;
+        while (*p != '\0' && *p != '@')
+          p++;
+      }
+    if (*p == '@')
+      {
+        p++;
+        modifier_start = p;
+        while (*p != '\0')
+          p++;
+        modifier_len = p - modifier_start;
+      }
+  }
+
+  if (language_len == 0)
+    /* No language -> fail.  */
+    goto fail;
+
+  /* Canonicalize the language.  */
+  /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the
+     ISO 639-1 code "qu".  */
+  if (language_len == 3 && memcmp (language_start, "quz", 3) == 0)
+    {
+      language_start = "qu";
+      language_len = 2;
+    }
+  /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the
+     ISO 639-2 code "ber".  */
+  else if (language_len == 3 && memcmp (language_start, "tzm", 3) == 0)
+    {
+      language_start = "ber";
+      language_len = 3;
+    }
+
+  const char *script_subtag = NULL;
+
+  /* Determine script from the modifier.  */
+  if (modifier_len > 0)
+    {
+      size_t i;
+      for (i = 0; i < NUM_SCRIPTS; i++)
+        if (strlen (scripts[i].name) == modifier_len
+            && memcmp (scripts[i].name, modifier_start, modifier_len) == 0)
+          script_subtag = scripts[i].code;
+    }
+
+  /* Determine script from the language and possibly the territory.  */
+  if (language_len > 0 && script_subtag == NULL)
+    {
+      /* Languages with a script that depends on the territory.  */
+      if (territory_len > 0)
+        {
+          if (language_len == 2)
+            {
+              if (memcmp (language_start, "az", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory_start, "AZ", 2) == 0)
+                        script_subtag = "Latn";
+                      else if (memcmp (territory_start, "IR", 2) == 0)
+                        script_subtag = "Arab";
+                    }
+                }
+              else if (memcmp (language_start, "ku", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory_start, "IQ", 2) == 0
+                          || memcmp (territory_start, "IR", 2) == 0)
+                        script_subtag = "Arab";
+                      else if (memcmp (territory_start, "SY", 2) == 0
+                               || memcmp (territory_start, "TR", 2) == 0)
+                        script_subtag = "Latn";
+                    }
+                }
+              else if (memcmp (language_start, "pa", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory_start, "PK", 2) == 0)
+                        script_subtag = "Arab";
+                      else if (memcmp (territory_start, "IN", 2) == 0)
+                        script_subtag = "Guru";
+                    }
+                }
+              else if (memcmp (language_start, "zh", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory_start, "CN", 2) == 0
+                          || memcmp (territory_start, "SG", 2) == 0)
+                        script_subtag = "Hans";
+                      else
+                        script_subtag = "Hant";
+                    }
+                }
+            }
+        }
+      /* Languages with a main script and one or more alternate scripts.  */
+      if (language_len == 2)
+        {
+          if (memcmp (language_start, "be", 2) == 0)
+            script_subtag = "Cyrl";
+          else if (memcmp (language_start, "bs", 2) == 0)
+            script_subtag = "Latn";
+          else if (memcmp (language_start, "ha", 2) == 0)
+            script_subtag = "Latn";
+          else if (memcmp (language_start, "iu", 2) == 0)
+            script_subtag = "Cans";
+          else if (memcmp (language_start, "kk", 2) == 0)
+            script_subtag = "Cyrl";
+          else if (memcmp (language_start, "ks", 2) == 0)
+            script_subtag = "Arab";
+          else if (memcmp (language_start, "mn", 2) == 0)
+            script_subtag = "Cyrl";
+          else if (memcmp (language_start, "sd", 2) == 0)
+            script_subtag = "Arab";
+          else if (memcmp (language_start, "sr", 2) == 0)
+            script_subtag = "Cyrl";
+          else if (memcmp (language_start, "uz", 2) == 0)
+            script_subtag = "Latn";
+          else if (memcmp (language_start, "yi", 2) == 0)
+            script_subtag = "Hebr";
+        }
+      else if (language_len == 3)
+        {
+          if (memcmp (language_start, "ber", 3) == 0)
+            script_subtag = "Latn";
+          else if (memcmp (language_start, "nan", 3) == 0)
+            script_subtag = "Hant";
+        }
+    }
+
+  /* Construct the result: language[-script][-territory].  */
+  if (language_len
+      + (script_subtag != NULL ? 1 + 4 : 0)
+      + (territory_len > 0 ? 1 + territory_len : 0)
+      < BCP47_MAX)
+    {
+      char *q = bcp47;
+      memcpy (q, language_start, language_len);
+      q += language_len;
+      if (script_subtag != NULL)
+        {
+          *q++ = '-';
+          memcpy (q, script_subtag, 4);
+          q += 4;
+        }
+      if (territory_len > 0)
+        {
+          *q++ = '-';
+          memcpy (q, territory_start, territory_len);
+          q += territory_len;
+        }
+      *q = '\0';
+      return;
+    }
+  else
+    goto fail;
+}
+
+void
+bcp47_to_xpg (char *xpg, const char *bcp47, const char *codeset)
+{
+  /* Special cases.  */
+  if (strcmp (bcp47, "") == 0)
+   fail:
+    {
+      strcpy (xpg, "");
+      return;
+    }
+
+  /* Parse BCP47 as
+     language{-extlang}*[-script][-region]{-variant}*{-extension}*.  */
+  const char *language_start = NULL;
+  size_t language_len = 0;
+  const char *script_start = NULL;
+  size_t script_len = 0;
+  const char *region_start = NULL;
+  size_t region_len = 0;
+
+  {
+    bool past_script = false;
+    bool past_region = false;
+    const char *p;
+
+    p = bcp47;
+    language_start = p;
+    while (*p != '\0' && *p != '-')
+      p++;
+    language_len = p - language_start;
+    while (*p != '\0')
+      {
+        if (*p == '-')
+          {
+            p++;
+            const char *subtag_start = p;
+            while (*p != '\0' && *p != '-')
+              p++;
+            size_t subtag_len = p - subtag_start;
+
+            if (!past_script && subtag_len == 4)
+              {
+                /* Parsed -script.  */
+                script_start = subtag_start;
+                script_len = subtag_len;
+                past_script = true;
+              }
+            else if (!past_region
+                     && (subtag_len == 2
+                         || (subtag_len == 3
+                             && subtag_start[0] >= '0' && subtag_start[0] <= '9'
+                             && subtag_start[1] >= '0' && subtag_start[1] <= '9'
+                             && subtag_start[2] >= '0' && subtag_start[2] <= '9')))
+              {
+                /* Parsed -region.  */
+                region_start = subtag_start;
+                region_len = subtag_len;
+                past_region = true;
+                past_script = true;
+              }
+            else
+              {
+                /* Is it -extlang or -variant or -extension?  */
+                if (!past_script && subtag_len == 3)
+                  {
+                    /* It is -extlang.  */
+                  }
+                else
+                  {
+                    /* It must be -variant or -extension.  */
+                    past_script = true;
+                    past_region = true;
+                  }
+              }
+          }
+      }
+  }
+
+  if (language_len == 0 || language_len >= BCP47_MAX)
+    /* No language or too long -> fail.  */
+    goto fail;
+
+  /* Copy the language to the result buffer, converting it to lower case.  */
+  {
+    size_t i;
+    for (i = 0; i < language_len; i++)
+      xpg[i] = c_tolower (language_start[i]);
+  }
+
+  /* Canonicalize the language.  */
+  /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the
+     ISO 639-1 code "qu".  */
+  if (language_len == 3 && memcmp (xpg, "quz", 3) == 0)
+    {
+      language_len = 2;
+      memcpy (xpg, "qu", language_len);
+    }
+  /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the
+     ISO 639-2 code "ber".  */
+  else if (language_len == 3 && memcmp (xpg, "tzm", 3) == 0)
+    {
+      language_len = 3;
+      memcpy (xpg, "ber", language_len);
+    }
+
+  /* Copy the region to a temporary buffer, converting it to upper case.  */
+  char territory[3];
+  size_t territory_len = region_len; /* == 2 or 3 */
+  {
+    size_t i;
+    for (i = 0; i < region_len; i++)
+      territory[i] = c_toupper (region_start[i]);
+  }
+
+  /* Determine script from the script subtag.  */
+  const char *script = NULL;
+
+  if (script_len > 0)
+    {
+      /* Here script_len == 4.  */
+      size_t i;
+      for (i = 0; i < NUM_SCRIPTS; i++)
+        if (c_toupper (script_start[0] == scripts[i].code[0])
+            && c_tolower (script_start[1] == scripts[i].code[1])
+            && c_tolower (script_start[2] == scripts[i].code[2])
+            && c_tolower (script_start[3] == scripts[i].code[3]))
+          script = scripts[i].name;
+    }
+
+  /* Possibly strip away the script, depending on the language and possibly
+     the territory.  */
+  if (script != NULL)
+    {
+      /* Languages with a script that depends on the territory.  */
+      if (territory_len > 0)
+        {
+          if (language_len == 2)
+            {
+              if (memcmp (xpg, "az", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory, "AZ", 2) == 0)
+                        {
+                          if (strcmp (script, "latin") == 0)
+                            script = NULL;
+                        }
+                      else if (memcmp (territory, "IR", 2) == 0)
+                        {
+                          if (strcmp (script, "arabic") == 0)
+                            script = NULL;
+                        }
+                    }
+                }
+              else if (memcmp (xpg, "ku", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory, "IQ", 2) == 0
+                          || memcmp (territory, "IR", 2) == 0)
+                        {
+                          if (strcmp (script, "arabic") == 0)
+                            script = NULL;
+                        }
+                      else if (memcmp (territory, "SY", 2) == 0
+                               || memcmp (territory, "TR", 2) == 0)
+                        {
+                          if (strcmp (script, "latin") == 0)
+                            script = NULL;
+                        }
+                    }
+                }
+              else if (memcmp (xpg, "pa", 2) == 0)
+                {
+                  if (territory_len == 2)
+                    {
+                      if (memcmp (territory, "PK", 2) == 0)
+                        {
+                          if (strcmp (script, "arabic") == 0)
+                            script = NULL;
+                        }
+                      else if (memcmp (territory, "IN", 2) == 0)
+                        {
+                          if (strcmp (script, "gurmukhi") == 0)
+                            script = NULL;
+                        }
+                    }
+                }
+              else if (memcmp (xpg, "zh", 2) == 0)
+                {
+                  /* "Hans" and "Hant" are not present in the scripts[] table,
+                     therefore nothing to do here.  */
+                }
+            }
+        }
+      /* Languages with a main script and one or more alternate scripts.  */
+      if (language_len == 2)
+        {
+          if (memcmp (xpg, "be", 2) == 0)
+            {
+              if (strcmp (script, "cyrillic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "bs", 2) == 0)
+            {
+              if (strcmp (script, "latin") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "ha", 2) == 0)
+            {
+              if (strcmp (script, "latin") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "iu", 2) == 0)
+            {
+              /* "Cans" is not present in the scripts[] table,
+                 therefore nothing to do here.  */
+            }
+          else if (memcmp (xpg, "kk", 2) == 0)
+            {
+              if (strcmp (script, "cyrillic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "ks", 2) == 0)
+            {
+              if (strcmp (script, "arabic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "mn", 2) == 0)
+            {
+              if (strcmp (script, "cyrillic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "sd", 2) == 0)
+            {
+              if (strcmp (script, "arabic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "sr", 2) == 0)
+            {
+              if (strcmp (script, "cyrillic") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "uz", 2) == 0)
+            {
+              if (strcmp (script, "latin") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "yi", 2) == 0)
+            {
+              if (strcmp (script, "hebrew") == 0)
+                script = NULL;
+            }
+        }
+      else if (language_len == 3)
+        {
+          if (memcmp (xpg, "ber", 3) == 0)
+            {
+              if (strcmp (script, "latin") == 0)
+                script = NULL;
+            }
+          else if (memcmp (xpg, "nan", 3) == 0)
+            {
+              /* "Hant" is not present in the scripts[] table,
+                 therefore nothing to do here.  */
+            }
+        }
+    }
+
+  /* The modifier is the script.  */
+  const char *modifier = script;
+
+  /* Construct the result: language[_territory][.codeset][@modifier].  */
+  size_t codeset_len = (codeset != NULL ? strlen (codeset) : 0);
+  size_t modifier_len = (modifier != NULL ? strlen (modifier) : 0);
+  if (language_len
+      + (territory_len > 0 ? 1 + territory_len : 0)
+      + (codeset != NULL ? 1 + codeset_len : 0)
+      + (modifier != NULL ? 1 + modifier_len : 0)
+      < BCP47_MAX)
+    {
+      char *q = xpg;
+      q += language_len;
+      if (territory_len > 0)
+        {
+          *q++ = '_';
+          memcpy (q, territory, territory_len);
+          q += territory_len;
+        }
+      if (codeset != NULL)
+        {
+          *q++ = '.';
+          memcpy (q, codeset, codeset_len);
+          q += codeset_len;
+        }
+      if (modifier != NULL)
+        {
+          *q++ = '@';
+          memcpy (q, modifier, modifier_len);
+          q += modifier_len;
+        }
+      *q = '\0';
+      return;
+    }
+  else
+    goto fail;
+}
diff --git a/lib/bcp47.h b/lib/bcp47.h
new file mode 100644
index 0000000000..3a044dfd29
--- /dev/null
+++ b/lib/bcp47.h
@@ -0,0 +1,75 @@
+/* Support for locale names in BCP 47 syntax.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation, either version 3 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <br...@clisp.org>, 2024.  */
+
+#ifndef _BCP47_H
+#define _BCP47_H
+
+/* A locale name can exist in three possible forms:
+
+     * The XPG syntax
+         language[_territory][.codeset][@modifier]
+       where
+         - The language is an ISO 639 (two-letter) language code.
+         - The territory is an ISO 3166 (two-letter) country code.
+         - The codeset is typically UTF-8.
+         - The supported @modifiers are usually something like
+             @euro
+             a script indicator, such as: @latin, @cyrillic, @devanagari
+
+     * The locale name understood by setlocale().
+       On glibc and many other Unix-like systems, this is the XPG syntax.
+       On native Windows, it is similar to XPG syntax, with English names
+       (instead of ISO codes) for the language and territory and with a
+       number for the codeset (e.g. 65001 for UTF-8).
+
+     * The BCP 47 syntax
+         language[-script][-region]{-variant}*{-extension}*
+       defined in
+         <https://www.ietf.org/rfc/bcp/bcp47.html>
+         = <https://www.rfc-editor.org/rfc/bcp/bcp47.txt>
+       which consists of RFC 5646 and RFC 4647.
+       See also <https://en.wikipedia.org/wiki/IETF_language_tag>.
+       Note: The BCP 47 syntax does not include a codeset.
+
+   This file provides conversions between the XPG syntax and the BCP 47
+   syntax.  */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Required size of buffer for a locale name.  */
+#define BCP47_MAX 100
+
+/* Converts a locale name in XPG syntax to a locale name in BCP 47 syntax.
+   Returns the result in bcp47, which must be at least BCP47_MAX bytes
+   large.  */
+extern void xpg_to_bcp47 (char *bcp47, const char *xpg);
+
+/* Converts a locale name in BCP 47 syntax (optionally with a codeset)
+   to a locale name in XPG syntax.
+   The specified codeset may be NULL.
+   Returns the result in xpg, which must be at least BCP47_MAX bytes
+   large.  */
+extern void bcp47_to_xpg (char *xpg, const char *bcp47, const char *codeset);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _BCP47_H */
diff --git a/modules/bcp47 b/modules/bcp47
new file mode 100644
index 0000000000..ca0ac88287
--- /dev/null
+++ b/modules/bcp47
@@ -0,0 +1,24 @@
+Description:
+Support for locale names in BCP 47 syntax.
+
+Files:
+lib/bcp47.h
+lib/bcp47.c
+
+Depends-on:
+stdbool
+c-ctype
+
+configure.ac:
+
+Makefile.am:
+lib_SOURCES += bcp47.c
+
+Include:
+"bcp47.h"
+
+License:
+LGPL
+
+Maintainer:
+all
-- 
2.34.1

>From 588b8d0c0cfe366c4fa24a367767b385edb62fec Mon Sep 17 00:00:00 2001
From: Bruno Haible <br...@clisp.org>
Date: Thu, 3 Oct 2024 20:45:59 +0200
Subject: [PATCH 2/2] bcp47: Add tests.

* tests/test-bcp47.c: New file.
* modules/bcp47-tests: New file.
---
 ChangeLog           |   4 +
 modules/bcp47-tests |  11 +++
 tests/test-bcp47.c  | 200 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 215 insertions(+)
 create mode 100644 modules/bcp47-tests
 create mode 100644 tests/test-bcp47.c

diff --git a/ChangeLog b/ChangeLog
index bbf15b7064..22a40c3d58 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2024-10-03  Bruno Haible  <br...@clisp.org>
 
+	bcp47: Add tests.
+	* tests/test-bcp47.c: New file.
+	* modules/bcp47-tests: New file.
+
 	bcp47: New module.
 	* lib/bcp47.h: New file.
 	* lib/bcp47.c: New file.
diff --git a/modules/bcp47-tests b/modules/bcp47-tests
new file mode 100644
index 0000000000..9abd149a93
--- /dev/null
+++ b/modules/bcp47-tests
@@ -0,0 +1,11 @@
+Files:
+tests/test-bcp47.c
+tests/macros.h
+
+Depends-on:
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-bcp47
+check_PROGRAMS += test-bcp47
diff --git a/tests/test-bcp47.c b/tests/test-bcp47.c
new file mode 100644
index 0000000000..c0efeea3a7
--- /dev/null
+++ b/tests/test-bcp47.c
@@ -0,0 +1,200 @@
+/* Test support for locale names in BCP 47 syntax.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is free software: you can redistribute it and/or modify
+   it under the terms of the GNU Lesser General Public License as
+   published by the Free Software Foundation, either version 3 of the
+   License, or (at your option) any later version.
+
+   This file is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* Written by Bruno Haible <br...@clisp.org>, 2024.  */
+
+#include <config.h>
+
+#include "bcp47.h"
+
+#include <string.h>
+
+#include "macros.h"
+
+static void
+test_correspondence (const char *xpg, const char *bcp47)
+{
+  /* Test xpg_to_bcp47.  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    xpg_to_bcp47 (buf, xpg);
+    ASSERT (strcmp (buf, bcp47) == 0);
+  }
+
+  /* Test bcp47_to_xpg.  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, bcp47, NULL);
+    ASSERT (strcmp (buf, xpg) == 0);
+  }
+}
+
+int
+main ()
+{
+  /* Languages with a single script.  */
+
+  test_correspondence ("de", "de");
+  test_correspondence ("de_DE", "de-DE");
+  test_correspondence ("de_AT", "de-AT");
+
+  /* Languages with a script that depends on the territory.  */
+
+  test_correspondence ("az_AZ", "az-Latn-AZ");
+  test_correspondence ("az_AZ@cyrillic", "az-Cyrl-AZ");
+  test_correspondence ("az_IR", "az-Arab-IR");
+
+  test_correspondence ("ku_IQ", "ku-Arab-IQ");
+  test_correspondence ("ku_IR", "ku-Arab-IR");
+  test_correspondence ("ku_SY", "ku-Latn-SY");
+  test_correspondence ("ku_TR", "ku-Latn-TR");
+
+  test_correspondence ("pa_PK", "pa-Arab-PK");
+  test_correspondence ("pa_IN", "pa-Guru-IN");
+
+  test_correspondence ("zh_CN", "zh-Hans-CN");
+  test_correspondence ("zh_HK", "zh-Hant-HK");
+  test_correspondence ("zh_MO", "zh-Hant-MO");
+  test_correspondence ("zh_SG", "zh-Hans-SG");
+  test_correspondence ("zh_TW", "zh-Hant-TW");
+
+  /* Languages with a main script and one or more alternate scripts.  */
+
+  test_correspondence ("be_BY", "be-Cyrl-BY");
+  test_correspondence ("be_BY@latin", "be-Latn-BY");
+
+  test_correspondence ("ber@arabic", "ber-Arab");
+  test_correspondence ("ber", "ber-Latn");
+  test_correspondence ("ber_DZ", "ber-Latn-DZ");
+  test_correspondence ("ber_MA", "ber-Latn-MA");
+
+  test_correspondence ("bs_BA", "bs-Latn-BA");
+  test_correspondence ("bs_BA@cyrillic", "bs-Cyrl-BA");
+
+  test_correspondence ("ha_NG", "ha-Latn-NG");
+  test_correspondence ("ha_NG@arabic", "ha-Arab-NG");
+
+  test_correspondence ("iu_CA", "iu-Cans-CA");
+  test_correspondence ("iu_CA@latin", "iu-Latn-CA");
+
+  test_correspondence ("kk_KZ", "kk-Cyrl-KZ");
+  test_correspondence ("kk_KZ@latin", "kk-Latn-KZ");
+
+  test_correspondence ("ks_IN", "ks-Arab-IN");
+  test_correspondence ("ks_IN@devanagari", "ks-Deva-IN");
+
+  test_correspondence ("mn_MN", "mn-Cyrl-MN");
+  test_correspondence ("mn_MN@mongolian", "mn-Mong-MN");
+
+  test_correspondence ("nan_TW", "nan-Hant-TW");
+  test_correspondence ("nan_TW@latin", "nan-Latn-TW");
+
+  test_correspondence ("sd_PK", "sd-Arab-PK");
+  test_correspondence ("sd_IN", "sd-Arab-IN");
+  test_correspondence ("sd_IN@devanagari", "sd-Deva-IN");
+
+  test_correspondence ("sr_BA@latin", "sr-Latn-BA");
+  test_correspondence ("sr_BA", "sr-Cyrl-BA");
+  test_correspondence ("sr_RS", "sr-Cyrl-RS");
+  test_correspondence ("sr_RS@latin", "sr-Latn-RS");
+
+  test_correspondence ("uz_UZ", "uz-Latn-UZ");
+  test_correspondence ("uz_UZ@cyrillic", "uz-Cyrl-UZ");
+
+  test_correspondence ("yi_US", "yi-Hebr-US");
+  test_correspondence ("yi_US@latin", "yi-Latn-US");
+
+  /* For Quechua, Microsoft uses the ISO 639-3 code "quz" instead of the
+     ISO 639-1 code "qu".  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "quz-PE", NULL);
+    ASSERT (strcmp (buf, "qu_PE") == 0);
+  }
+
+  /* For Tamazight, Microsoft uses the ISO 639-3 code "tzm" instead of the
+     ISO 639-2 code "ber".  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "tzm-MA", NULL);
+    ASSERT (strcmp (buf, "ber_MA") == 0);
+  }
+
+  /* Test xpg_to_bcp47 with an encoding.  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    xpg_to_bcp47 (buf, "en_US.UTF-8");
+    ASSERT (strcmp (buf, "en-US") == 0);
+  }
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    xpg_to_bcp47 (buf, "az_AZ.UTF-8@cyrillic");
+    ASSERT (strcmp (buf, "az-Cyrl-AZ") == 0);
+  }
+
+  /* Test bcp47_to_xpg with an encoding.  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "en-US", "UTF-8");
+    ASSERT (strcmp (buf, "en_US.UTF-8") == 0);
+  }
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "az-Cyrl-AZ", "UTF-8");
+    ASSERT (strcmp (buf, "az_AZ.UTF-8@cyrillic") == 0);
+  }
+
+  /* Test case mapping done by bcp47_to_xpg.  */
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "EN-US", "UTF-8");
+    ASSERT (strcmp (buf, "en_US.UTF-8") == 0);
+  }
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "en-us", "UTF-8");
+    ASSERT (strcmp (buf, "en_US.UTF-8") == 0);
+  }
+  {
+    char buf[BCP47_MAX];
+    memset (buf, 0x77, BCP47_MAX);
+
+    bcp47_to_xpg (buf, "Zh-hANs-cN", "UTF-8");
+    ASSERT (strcmp (buf, "zh_CN.UTF-8") == 0);
+  }
+
+  return test_exit_status;
+}
-- 
2.34.1

new module 'bcp47'

Reply via email to