andygrove commented on issue #21943: URL: https://github.com/apache/datafusion/issues/21943#issuecomment-4348656415
While auditing Comet's `parse_url` integration in https://github.com/apache/datafusion-comet/pull/4152 we surfaced two more divergences from Spark beyond the ones in the original issue body: ### 3. PATH on a URL with a bare trailing slash ```text parse_url('http://example.com/', 'PATH') ``` - Spark (`URI.getRawPath`) returns `"/"`. - DataFusion (`spark_parse_url` in `datafusion-spark`) explicitly maps `path == "/"` to `""` (see `parse_url.rs::ParseUrl::parse`, the unit test `test_parse_path_root_is_empty_string` asserts this behavior). ### 4. 3-arg `parse_url(_, 'QUERY', key)` decodes percent-encoded values ```text parse_url('https://use%20r:pas%[email protected]/x?query=x%20y', 'QUERY', 'query') ``` - Spark returns the raw value `x%20y` (uses regex over `getRawQuery`). - DataFusion calls `Url::query_pairs()` which form-decodes values, returning `x y`. This is a subtler divergence than the others because the 2-arg `parse_url(_, 'QUERY')` does match Spark (both return the raw query string), but the 3-arg form with a key diverges. For reference, Spark's own `UrlFunctionsSuite` exercises both behaviors in a single row (`https://use%20r:pas%[email protected]/dir%20/pa%20th.HTML?query=x%20y&q2=2#Ref%20two` expects `x%20y` as the QUERY-with-key answer). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
