andygrove commented on issue #21943:
URL: https://github.com/apache/datafusion/issues/21943#issuecomment-4348656415

   While auditing Comet's `parse_url` integration in 
https://github.com/apache/datafusion-comet/pull/4152 we surfaced two more 
divergences from Spark beyond the ones in the original issue body:
   
   ### 3. PATH on a URL with a bare trailing slash
   
   ```text
   parse_url('http://example.com/', 'PATH')
   ```
   
   - Spark (`URI.getRawPath`) returns `"/"`.
   - DataFusion (`spark_parse_url` in `datafusion-spark`) explicitly maps `path 
== "/"` to `""` (see `parse_url.rs::ParseUrl::parse`, the unit test 
`test_parse_path_root_is_empty_string` asserts this behavior).
   
   ### 4. 3-arg `parse_url(_, 'QUERY', key)` decodes percent-encoded values
   
   ```text
   parse_url('https://use%20r:pas%[email protected]/x?query=x%20y', 'QUERY', 
'query')
   ```
   
   - Spark returns the raw value `x%20y` (uses regex over `getRawQuery`).
   - DataFusion calls `Url::query_pairs()` which form-decodes values, returning 
`x y`.
   
   This is a subtler divergence than the others because the 2-arg `parse_url(_, 
'QUERY')` does match Spark (both return the raw query string), but the 3-arg 
form with a key diverges.
   
   For reference, Spark's own `UrlFunctionsSuite` exercises both behaviors in a 
single row 
(`https://use%20r:pas%[email protected]/dir%20/pa%20th.HTML?query=x%20y&q2=2#Ref%20two`
 expects `x%20y` as the QUERY-with-key answer).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to