raulcd opened a new issue, #47451:
URL: https://github.com/apache/arrow/issues/47451

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   TLDR: Timezones seem to be missing on new Docker image with updated GLIBC
   
   Our manylinux jobs are failing with several errors locating timezone data 
`Cannot locate timezone 'CET': CET not found in timezone database`.
   
   This started happening on the 13th of August.
   
   I've been able to reproduce locally via archery.
   ```bash
   $ archery docker run python-wheel-manylinux-test-imports
   ```
   
   If I run archery interactive, install requirement dependencies and run the 
old wheel (not only the newly generated one) I can reproduce, meaning the 
problem is with the Dockerfile.
   
   ```bash
   $ docker compose run --rm -it python-wheel-manylinux-test-unittests /bin/bash
   WARN[0000] The "R_UPDATE_CLANG" variable is not set. Defaulting to a blank 
string. 
   root@5d05c57b2ff1:/# python -m pip install -U -r 
/arrow/python/requirements-wheel-test.txt
   ...
   root@5d05c57b2ff1:/# pip install pyarrow
   Collecting pyarrow
     Downloading pyarrow-21.0.0-cp39-cp39-manylinux_2_28_x86_64.whl (42.7 MB)
        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.7/42.7 MB 19.7 MB/s eta 
0:00:00
   Installing collected packages: pyarrow
   Successfully installed pyarrow-21.0.0
   WARNING: Running pip as the 'root' user can result in broken permissions and 
conflicting behaviour with the system package manager. It is recommended to use 
a virtual environment instead: https://pip.pypa.io/warnings/venv
   
   [notice] A new release of pip is available: 23.0.1 -> 25.2
   [notice] To update, run: pip install --upgrade pip
   root@5d05c57b2ff1:/# python
   Python 3.9.23 (main, Aug 12 2025, 23:06:01) 
   [GCC 14.2.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pandas as pd
   >>> import pyarrow as pa
   >>> import pyarrow.compute as pc
   >>> 
   >>> times = ["2018-03-10 09:00", "2038-01-31 12:23", None]
   >>> timezones = ["CET", "UTC", "Europe/Ljubljana"]
   >>> formats = ["%a", "%A", "%w", "%d", "%b", "%B", "%m", "%y", "%Y", "%H", 
"%I","%p", "%M", "%z", "%Z", "%j", "%U", "%W", "%%", "%G", "%V", "%u"]
   >>> 
   >>> for timezone in timezones:
   ...   ts = pd.to_datetime(times).tz_localize(timezone)
   ...   for unit in ["s", "ms", "us", "ns"]:
   ...     tsa = pa.array(ts, type=pa.timestamp(unit, timezone))
   ...     for fmt in formats:
   ...        options = pc.StrftimeOptions(fmt)
   ...        result = pc.strftime(tsa, options=options)
   ... 
   Traceback (most recent call last):
     File "<stdin>", line 7, in <module>
     File "/usr/local/lib/python3.9/site-packages/pyarrow/compute.py", line 
269, in wrapper
       return func.call(args, options, memory_pool)
     File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Cannot locate timezone 'CET': CET not found in 
timezone database
   >>> 
   ```
   
   I've found that during those days the upstream Python docker image updated 
from `deb / debian/glibc / 2.36-9+deb12u10` to `deb / debian/glibc / 2.41-12` 
and I can see that the timezones seem to be indeed missing:
   ```
   root@5d05c57b2ff1:/# ldd --version
   ldd (Debian GLIBC 2.41-12) 2.41
   Copyright (C) 2024 Free Software Foundation, Inc.
   This is free software; see the source for copying conditions.  There is NO
   warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
   Written by Roland McGrath and Ulrich Drepper.
   root@5d05c57b2ff1:/# ls /usr/share/zoneinfo
   Africa        Antarctica  Asia      Australia  Europe   GMT     Pacific  
iso3166.tab        leapseconds  posixrules  zone.tab      zonenow.tab
   America  Arctic      Atlantic  Etc     Factory  Indian  UTC      
leap-seconds.list  localtime    tzdata.zi   zone1970.tab
   ```
   
   [Full 
log](https://github.com/ursacomputing/crossbow/actions/runs/17261674193/job/48984354574):
   ```python
   ________________________________ test_strftime 
_________________________________
   
       @pytest.mark.pandas
       @pytest.mark.timezone_data
       def test_strftime():
           times = ["2018-03-10 09:00", "2038-01-31 12:23", None]
           timezones = ["CET", "UTC", "Europe/Ljubljana"]
       
           formats = ["%a", "%A", "%w", "%d", "%b", "%B", "%m", "%y", "%Y", 
"%H", "%I",
                      "%p", "%M", "%z", "%Z", "%j", "%U", "%W", "%%", "%G", 
"%V", "%u"]
           if sys.platform != "win32":
               # Locale-dependent formats don't match on Windows
               formats.extend(["%c", "%x", "%X"])
       
           for timezone in timezones:
               ts = pd.to_datetime(times).tz_localize(timezone)
               for unit in ["s", "ms", "us", "ns"]:
                   tsa = pa.array(ts, type=pa.timestamp(unit, timezone))
                   for fmt in formats:
                       options = pc.StrftimeOptions(fmt)
   >                   result = pc.strftime(tsa, options=options)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   
   usr/local/lib/python3.13/site-packages/pyarrow/tests/test_compute.py:2297: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   usr/local/lib/python3.13/site-packages/pyarrow/compute.py:269: in wrapper
       return func.call(args, options, memory_pool)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   pyarrow/_compute.pyx:399: in pyarrow._compute.Function.call
       ???
   pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowInvalid: Cannot locate timezone 'CET': CET not found in 
timezone database
   
   pyarrow/error.pxi:92: ArrowInvalid
   ________________ test_example_using_json[TestOrcFile.test1.orc] 
________________
   
   filename = 'TestOrcFile.test1.orc'
   datadir = 
PosixPath('/usr/local/lib/python3.13/site-packages/pyarrow/tests/data/orc')
   
       @pytest.mark.pandas
       @pytest.mark.parametrize('filename', [
           'TestOrcFile.test1.orc',
           'TestOrcFile.testDate1900.orc',
           'decimal.orc'
       ])
       def test_example_using_json(filename, datadir):
           """
           Check a ORC file example against the equivalent JSON file, as given
           in the Apache ORC repository (the JSON file has one JSON object per
           line, corresponding to one row in the ORC file).
           """
           # Read JSON file
           path = datadir / filename
           table = pd.read_json(str(path.with_suffix('.jsn.gz')), lines=True)
   >       check_example_file(path, table, need_fix=True)
   
   usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:145: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:101: in 
check_example_file
       table = orc_file.read()
               ^^^^^^^^^^^^^^^
   usr/local/lib/python3.13/site-packages/pyarrow/orc.py:187: in read
       return self.reader.read(columns=columns)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   pyarrow/_orc.pyx:374: in pyarrow._orc.ORCReader.read
       ???
   pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowException: Unknown error: Time zone file 
/usr/share/zoneinfo/US/Pacific does not exist. Please install IANA time zone 
database and set TZDIR env.
   
   pyarrow/error.pxi:92: ArrowException
   ____________ test_example_using_json[TestOrcFile.testDate1900.orc] 
_____________
   
   filename = 'TestOrcFile.testDate1900.orc'
   datadir = 
PosixPath('/usr/local/lib/python3.13/site-packages/pyarrow/tests/data/orc')
   
       @pytest.mark.pandas
       @pytest.mark.parametrize('filename', [
           'TestOrcFile.test1.orc',
           'TestOrcFile.testDate1900.orc',
           'decimal.orc'
       ])
       def test_example_using_json(filename, datadir):
           """
           Check a ORC file example against the equivalent JSON file, as given
           in the Apache ORC repository (the JSON file has one JSON object per
           line, corresponding to one row in the ORC file).
           """
           # Read JSON file
           path = datadir / filename
           table = pd.read_json(str(path.with_suffix('.jsn.gz')), lines=True)
   >       check_example_file(path, table, need_fix=True)
   
   usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:145: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:101: in 
check_example_file
       table = orc_file.read()
               ^^^^^^^^^^^^^^^
   usr/local/lib/python3.13/site-packages/pyarrow/orc.py:187: in read
       return self.reader.read(columns=columns)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   pyarrow/_orc.pyx:374: in pyarrow._orc.ORCReader.read
       ???
   pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ 
   
   >   ???
   E   pyarrow.lib.ArrowException: Unknown error: Time zone file 
/usr/share/zoneinfo/US/Pacific does not exist. Please install IANA time zone 
database and set TZDIR env.
   ```
   
   ### Component(s)
   
   Python, Continuous Integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to