q-aaronzolnailucas opened a new issue, #44543: URL: https://github.com/apache/arrow/issues/44543
### Describe the enhancement requested ## Problem When using pyarrow's HDFS implementation, there are a few environment variables required ([documented](https://arrow.apache.org/docs/python/filesystems.html#hadoop-distributed-file-system-hdfs)). If `CLASSPATH` is not set correctly, due to a faulty or partial hadoop installation, the user can get a puzzling error message, especially if not a Java developer: ``` could not find method getRootCauseMessage from class (null) with signature (Ljava/lang/Throwable;)Ljava/lang/String; # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f239f385fb0, pid=33, tid=33 # # JRE version: OpenJDK Runtime Environment (11.0.25+9) (build 11.0.25+9-post-Debian-1deb11u1) # Java VM: OpenJDK 64-Bit Server VM (11.0.25+9-post-Debian-1deb11u1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x571fb0] AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 1097844ul>::oop_access_barrier(void*)+0x0 # # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # //hs_err_pid33.log # # If you would like to submit a bug report, please visit: # https://bugs.debian.org/openjdk-11 # Aborted ``` I'd like to see this be made a bit clearer, especially because hadoop distributions are large and I think it's quite common for people to try and slim them down, which can introduce these bugs. The error messages are much clearer when `CLASSPATH` is not set, or when libhdfs/libjvm is not found (although these cases would ideally catch the C error messages and pass them to the python exceptions too), and they are catchable with a python `except` clause since they are python exceptions. Furthermore, the following function evaluates to True, so cannot be used to check the classpath: ```python >>> import pyarrow >>> pyarrow.lib.have_libhdfs() True ``` So all in all it's really hard to make this fail fast or handle errors. Here is a docker image to reproduce the above (warning: this skipped certificate checking): ```DockerfileFROM python:3.10-slim-bullseye ARG HADOOP_VERSION="3.3.4" RUN apt update && apt install -y openjdk-11-jdk wget RUN python -m pip install fsspec[hdfs] universal-pathlib typing_extensions RUN wget -nc --no-check-certificate \ https://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz RUN mkdir -p /opt/hadoop \ && tar \ --strip-components 1 \ --skip-old-files \ -xzf /hadoop-$HADOOP_VERSION.tar.gz \ -C /opt/hadoop/ \ && rm /hadoop-$HADOOP_VERSION.tar.gz ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ENV HADOOP_HOME=/opt/hadoop ENV CLASSPATH="<WRONG_CLASSPATH>" CMD ["python", "-c", "import pyarrow._hdfs; pyarrow._hdfs.HadoopFileSystem('')"] ``` ## Proposed Solution * have this produce an `OSError` instead to be consistent with other errors * (possibly) `pyarrow.lib.have_libhdfs` to take an optional keyword-argument `check_classpath=True` which fails fast when the required jars are not present on `CLASSPATH`. This may require changes to the C API too. ### Related issues * A different env var error for HDFS: https://github.com/apache/arrow/issues/40305 * Regarding required classpath elements: https://github.com/apache/arrow/issues/30903 ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org