kevinjqliu commented on PR #2291:
URL: https://github.com/apache/iceberg-python/pull/2291#issuecomment-3184992641

   Summarizing my thoughts a bit. 
   
   For context, heres the way to parse url today when interacting with pyarrow 
fileio. 
   1. A [`location` string is passed to the file 
io](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L609-L615).
 this is assumed to be an absolute path as required by the iceberg spec. 
however, "absolute" can mean different things depending on the file system
   2. the `location` string is then parsed by the [PyarrowFileIO's 
`parse_location` 
function](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L393-L401).
 This functions returns the `scheme`, `netloc`, and `path` of the `location` 
string. `scheme`, and `netloc` is used to determine the FS implementation. 
[`path` is used by the FS 
implementation](https://github.com/apache/iceberg-python/blob/640c592b705db01199020db8e5f2b6e2c67f06bf/pyiceberg/io/pyarrow.py#L292)
 to access the file. 
   
   What would be helpful here is to figure out what is the expected format for 
the `location` string for HadoopFileSystem. 
   
   If it includes the hdfs host name and port, i.e. 
`hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi`,
 `parse_location` will return the `/user/tmccormi` as the path. 
   ```
   from urllib.parse import urlparse
   
urlparse("hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/user/tmccormi")
   
   ParseResult(scheme='hdfs', 
netloc='ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000', 
path='/user/tmccormi', params='', query='', fragment='')
   ```
   
   If it does not include the hdfs host name and port, i.e. `/user/tmccormi`, 
`parse_location` will still return `/user/tmccormi` as the path. But the scheme 
is empty. 
   ```
   urlparse("/user/tmccormi")
   
   ParseResult(scheme='', netloc='', path='/user/tmccormi', params='', 
query='', fragment='')
   ```
   
   Right now there are 2 places in the codebase where we can make changes to 
the fileio behavior. 
   1. `parse_location`
   2. Overload the Filesystem implementation
   
   I would like to avoid adding more switch cases in `parse_location` for ease 
of maintenance. Perhaps as an alternative, we can let the specific FileSystem 
implementation handle how it would like to `parse_location`
   I would also like to keep the current way of configuring fileio with just 
properties, not properties and configs. 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to