kevinjqliu commented on PR #2291: URL: https://github.com/apache/iceberg-python/pull/2291#issuecomment-3184992641
Summarizing my thoughts a bit. For context, heres the way to parse url today when interacting with pyarrow fileio. 1. A [`location` string is passed to the file io](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L609-L615). this is assumed to be an absolute path as required by the iceberg spec. however, "absolute" can mean different things depending on the file system 2. the `location` string is then parsed by the [PyarrowFileIO's `parse_location` function](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L393-L401). This functions returns the `scheme`, `netloc`, and `path` of the `location` string. `scheme`, and `netloc` is used to determine the FS implementation. [`path` is used by the FS implementation](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L292) to access the file. What would be helpful here is to figure out what is the expected format for the `location` string for HadoopFileSystem. If it includes the hdfs host name and port, i.e. `hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi`, `parse_location` will return the `/user/tmccormi` as the path. ``` from urllib.parse import urlparse urlparse("hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi") ParseResult(scheme='hdfs', netloc='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000', path='/user/tmccormi', params='', query='', fragment='') ``` If it does not include the hdfs host name and port, i.e. `/user/tmccormi`, `parse_location` will still return `/user/tmccormi` as the path. But the scheme is empty. ``` urlparse("/user/tmccormi") ParseResult(scheme='', netloc='', path='/user/tmccormi', params='', query='', fragment='') ``` Right now there are 2 places in the codebase where we can make changes to the fileio behavior. 1. `parse_location` 2. Overload the Filesystem implementation I would like to avoid adding more switch cases in `parse_location` for ease of maintenance. Perhaps as an alternative, we can let the specific FileSystem implementation handle how it would like to `parse_location` I would also like to keep the current way of configuring fileio with just properties, not properties and configs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
