raulcd opened a new issue, #47451:
URL: https://github.com/apache/arrow/issues/47451
### Describe the bug, including details regarding any error messages,
version, and platform.
TLDR: Timezones seem to be missing on new Docker image with updated GLIBC
Our manylinux jobs are failing with several errors locating timezone data
`Cannot locate timezone 'CET': CET not found in timezone database`.
This started happening on the 13th of August.
I've been able to reproduce locally via archery.
```bash
$ archery docker run python-wheel-manylinux-test-imports
```
If I run archery interactive, install requirement dependencies and run the
old wheel (not only the newly generated one) I can reproduce, meaning the
problem is with the Dockerfile.
```bash
$ docker compose run --rm -it python-wheel-manylinux-test-unittests /bin/bash
WARN[0000] The "R_UPDATE_CLANG" variable is not set. Defaulting to a blank
string.
root@5d05c57b2ff1:/# python -m pip install -U -r
/arrow/python/requirements-wheel-test.txt
...
root@5d05c57b2ff1:/# pip install pyarrow
Collecting pyarrow
Downloading pyarrow-21.0.0-cp39-cp39-manylinux_2_28_x86_64.whl (42.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.7/42.7 MB 19.7 MB/s eta
0:00:00
Installing collected packages: pyarrow
Successfully installed pyarrow-21.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and
conflicting behaviour with the system package manager. It is recommended to use
a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
root@5d05c57b2ff1:/# python
Python 3.9.23 (main, Aug 12 2025, 23:06:01)
[GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>>
>>> times = ["2018-03-10 09:00", "2038-01-31 12:23", None]
>>> timezones = ["CET", "UTC", "Europe/Ljubljana"]
>>> formats = ["%a", "%A", "%w", "%d", "%b", "%B", "%m", "%y", "%Y", "%H",
"%I","%p", "%M", "%z", "%Z", "%j", "%U", "%W", "%%", "%G", "%V", "%u"]
>>>
>>> for timezone in timezones:
... ts = pd.to_datetime(times).tz_localize(timezone)
... for unit in ["s", "ms", "us", "ns"]:
... tsa = pa.array(ts, type=pa.timestamp(unit, timezone))
... for fmt in formats:
... options = pc.StrftimeOptions(fmt)
... result = pc.strftime(tsa, options=options)
...
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/usr/local/lib/python3.9/site-packages/pyarrow/compute.py", line
269, in wrapper
return func.call(args, options, memory_pool)
File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot locate timezone 'CET': CET not found in
timezone database
>>>
```
I've found that during those days the upstream Python docker image updated
from `deb / debian/glibc / 2.36-9+deb12u10` to `deb / debian/glibc / 2.41-12`
and I can see that the timezones seem to be indeed missing:
```
root@5d05c57b2ff1:/# ldd --version
ldd (Debian GLIBC 2.41-12) 2.41
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@5d05c57b2ff1:/# ls /usr/share/zoneinfo
Africa Antarctica Asia Australia Europe GMT Pacific
iso3166.tab leapseconds posixrules zone.tab zonenow.tab
America Arctic Atlantic Etc Factory Indian UTC
leap-seconds.list localtime tzdata.zi zone1970.tab
```
[Full
log](https://github.com/ursacomputing/crossbow/actions/runs/17261674193/job/48984354574):
```python
________________________________ test_strftime
_________________________________
@pytest.mark.pandas
@pytest.mark.timezone_data
def test_strftime():
times = ["2018-03-10 09:00", "2038-01-31 12:23", None]
timezones = ["CET", "UTC", "Europe/Ljubljana"]
formats = ["%a", "%A", "%w", "%d", "%b", "%B", "%m", "%y", "%Y",
"%H", "%I",
"%p", "%M", "%z", "%Z", "%j", "%U", "%W", "%%", "%G",
"%V", "%u"]
if sys.platform != "win32":
# Locale-dependent formats don't match on Windows
formats.extend(["%c", "%x", "%X"])
for timezone in timezones:
ts = pd.to_datetime(times).tz_localize(timezone)
for unit in ["s", "ms", "us", "ns"]:
tsa = pa.array(ts, type=pa.timestamp(unit, timezone))
for fmt in formats:
options = pc.StrftimeOptions(fmt)
> result = pc.strftime(tsa, options=options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
usr/local/lib/python3.13/site-packages/pyarrow/tests/test_compute.py:2297:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
usr/local/lib/python3.13/site-packages/pyarrow/compute.py:269: in wrapper
return func.call(args, options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyarrow/_compute.pyx:399: in pyarrow._compute.Function.call
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
> ???
E pyarrow.lib.ArrowInvalid: Cannot locate timezone 'CET': CET not found in
timezone database
pyarrow/error.pxi:92: ArrowInvalid
________________ test_example_using_json[TestOrcFile.test1.orc]
________________
filename = 'TestOrcFile.test1.orc'
datadir =
PosixPath('/usr/local/lib/python3.13/site-packages/pyarrow/tests/data/orc')
@pytest.mark.pandas
@pytest.mark.parametrize('filename', [
'TestOrcFile.test1.orc',
'TestOrcFile.testDate1900.orc',
'decimal.orc'
])
def test_example_using_json(filename, datadir):
"""
Check a ORC file example against the equivalent JSON file, as given
in the Apache ORC repository (the JSON file has one JSON object per
line, corresponding to one row in the ORC file).
"""
# Read JSON file
path = datadir / filename
table = pd.read_json(str(path.with_suffix('.jsn.gz')), lines=True)
> check_example_file(path, table, need_fix=True)
usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:145:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:101: in
check_example_file
table = orc_file.read()
^^^^^^^^^^^^^^^
usr/local/lib/python3.13/site-packages/pyarrow/orc.py:187: in read
return self.reader.read(columns=columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyarrow/_orc.pyx:374: in pyarrow._orc.ORCReader.read
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
> ???
E pyarrow.lib.ArrowException: Unknown error: Time zone file
/usr/share/zoneinfo/US/Pacific does not exist. Please install IANA time zone
database and set TZDIR env.
pyarrow/error.pxi:92: ArrowException
____________ test_example_using_json[TestOrcFile.testDate1900.orc]
_____________
filename = 'TestOrcFile.testDate1900.orc'
datadir =
PosixPath('/usr/local/lib/python3.13/site-packages/pyarrow/tests/data/orc')
@pytest.mark.pandas
@pytest.mark.parametrize('filename', [
'TestOrcFile.test1.orc',
'TestOrcFile.testDate1900.orc',
'decimal.orc'
])
def test_example_using_json(filename, datadir):
"""
Check a ORC file example against the equivalent JSON file, as given
in the Apache ORC repository (the JSON file has one JSON object per
line, corresponding to one row in the ORC file).
"""
# Read JSON file
path = datadir / filename
table = pd.read_json(str(path.with_suffix('.jsn.gz')), lines=True)
> check_example_file(path, table, need_fix=True)
usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:145:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
usr/local/lib/python3.13/site-packages/pyarrow/tests/test_orc.py:101: in
check_example_file
table = orc_file.read()
^^^^^^^^^^^^^^^
usr/local/lib/python3.13/site-packages/pyarrow/orc.py:187: in read
return self.reader.read(columns=columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyarrow/_orc.pyx:374: in pyarrow._orc.ORCReader.read
???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
> ???
E pyarrow.lib.ArrowException: Unknown error: Time zone file
/usr/share/zoneinfo/US/Pacific does not exist. Please install IANA time zone
database and set TZDIR env.
```
### Component(s)
Python, Continuous Integration
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]