[I] [Python] Add better error reporting for missing items on classpath for HadoopFileSystem [arrow]

via GitHub Mon, 28 Oct 2024 07:22:00 -0700


q-aaronzolnailucas opened a new issue, #44543:
URL: https://github.com/apache/arrow/issues/44543


   ### Describe the enhancement requested
   
   ## Problem
   
   When using pyarrow's HDFS implementation, there are a few environment 
variables required 
([documented](https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs)).
 If `CLASSPATH` is not set correctly, due to a faulty or partial hadoop 
installation, the user can get a puzzling error message, especially if not a 
Java developer:
   
   ```
   could not find method getRootCauseMessage from class (null) with signature 
(Ljava/lang/Throwable;)Ljava/lang/String;
   #
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x00007f239f385fb0, pid=33, tid=33
   #
   # JRE version: OpenJDK Runtime Environment (11.0.25+9) (build 
11.0.25+9-post-Debian-1deb11u1)
   # Java VM: OpenJDK 64-Bit Server VM (11.0.25+9-post-Debian-1deb11u1, mixed 
mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
   # Problematic frame:
   # V  [libjvm.so+0x571fb0]  
AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, 
G1BarrierSet>, (AccessInternal::BarrierType)2, 
1097844ul>::oop_access_barrier(void*)+0x0
   #
   # No core dump will be written. Core dumps have been disabled. To enable 
core dumping, try "ulimit -c unlimited" before starting Java again
   #
   # An error report file with more information is saved as:
   # //hs_err_pid33.log
   #
   # If you would like to submit a bug report, please visit:
   #   https://bugs.debian.org/openjdk-11
   #
   Aborted
   ```
   I'd like to see this be made a bit clearer, especially because hadoop 
distributions are large and I think it's quite common for people to try and 
slim them down, which can introduce these bugs.
   
   The error messages are much clearer when `CLASSPATH` is not set, or when 
libhdfs/libjvm is not found (although these cases would ideally catch the C 
error messages and pass them to the python exceptions too), and they are 
catchable with a python `except` clause since they are python exceptions.
   
   Furthermore, the following function evaluates to True, so cannot be used to 
check the classpath:
   ```python
   >>> import pyarrow
   >>> pyarrow.lib.have_libhdfs()
   True
   ```
   So all in all it's really hard to make this fail fast or handle errors.
   
   Here is a docker image to reproduce the above (warning: this skipped 
certificate checking):
   ```DockerfileFROM python:3.10-slim-bullseye
   
   ARG HADOOP_VERSION="3.3.4"
   
   RUN apt update && apt install -y openjdk-11-jdk wget
   
   RUN python -m pip install fsspec[hdfs] universal-pathlib typing_extensions
   
   RUN wget -nc --no-check-certificate \
        
https://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
   
   RUN mkdir -p /opt/hadoop \
        && tar \
                --strip-components 1 \
                --skip-old-files \
                -xzf /hadoop-$HADOOP_VERSION.tar.gz \
                -C /opt/hadoop/ \
        && rm /hadoop-$HADOOP_VERSION.tar.gz
   
   ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
   ENV HADOOP_HOME=/opt/hadoop
   ENV CLASSPATH="<WRONG_CLASSPATH>"
   
   CMD ["python", "-c", "import pyarrow._hdfs; 
pyarrow._hdfs.HadoopFileSystem('')"]
   ```
   
   ## Proposed Solution
   
   * have this produce an `OSError` instead to be consistent with other errors
   * (possibly) `pyarrow.lib.have_libhdfs` to take an optional keyword-argument 
`check_classpath=True` which fails fast when the required jars are not present 
on `CLASSPATH`. This may require changes to the C API too.
   
   ### Related issues
   * A different env var error for HDFS: 
https://github.com/apache/arrow/issues/40305 
   * Regarding required classpath elements: 
https://github.com/apache/arrow/issues/30903
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python] Add better error reporting for missing items on classpath for HadoopFileSystem [arrow]

Reply via email to