andygrove opened a new issue, #4156: URL: https://github.com/apache/datafusion-comet/issues/4156
### Background `parse_url` was added in #4152. The native implementation (`datafusion-spark`'s `spark_parse_url`) diverges from Spark on several edge cases, so `CometParseUrl` is marked `Incompatible` and falls back to Spark by default. Users opt into the native path with \`spark.comet.expression.ParseUrl.allowIncompatible=true\`. This issue tracks the work to bring the native implementation closer to Spark and reduce the surface that requires opt-in. ### Known divergences Tracked upstream at https://github.com/apache/datafusion/issues/21943. 1. **Empty-string URL** returns NULL where Spark returns \`""\`. Hits every part key. 2. **\`FILE\` on a URL without an explicit path** inserts a leading \`/\` (e.g. \`/?foo=bar\`) where Spark returns the bare query (\`?foo=bar\`). 3. **\`PATH\` on a URL with a bare trailing slash** returns \`""\` where Spark returns \`/\`. 4. **3-arg \`parse_url(_, 'QUERY', key)\`** form-decodes the value (\`x%20y\` → \`x y\`) where Spark preserves the raw percent-encoding. The 2-arg \`parse_url(_, 'QUERY')\` does match Spark; only the 3-arg form with a key diverges. ### Suggested work plan 1. Drive fixes upstream in DataFusion for #1-#4 above so we can pick them up via a \`datafusion-spark\` bump. 2. Once #1-#4 are resolved, demote \`CometParseUrl.getSupportLevel\` from \`Incompatible\` to \`Compatible\` and remove the \`expect_fallback\` queries in \`spark/src/test/resources/sql-tests/expressions/url/parse_url.sql\`. 3. Expand the native test (\`parse_url_native.sql\`) to cover the URL shapes currently in the fallback file (e.g. trailing-slash PATH, percent-encoded query value) so we have positive coverage once each divergence is fixed. 4. Consider extending \`parse_url\` test coverage to invalid URLs in non-ANSI mode (Spark returns NULL for everything). ### Related - PR #4152 (initial implementation, this issue is the follow-up) - Upstream tracking: https://github.com/apache/datafusion/issues/21943 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
