andygrove opened a new issue, #4156:
URL: https://github.com/apache/datafusion-comet/issues/4156

   ### Background
   
   `parse_url` was added in #4152. The native implementation 
(`datafusion-spark`'s `spark_parse_url`) diverges from Spark on several edge 
cases, so `CometParseUrl` is marked `Incompatible` and falls back to Spark by 
default. Users opt into the native path with 
\`spark.comet.expression.ParseUrl.allowIncompatible=true\`.
   
   This issue tracks the work to bring the native implementation closer to 
Spark and reduce the surface that requires opt-in.
   
   ### Known divergences
   
   Tracked upstream at https://github.com/apache/datafusion/issues/21943.
   
   1. **Empty-string URL** returns NULL where Spark returns \`""\`. Hits every 
part key.
   2. **\`FILE\` on a URL without an explicit path** inserts a leading \`/\` 
(e.g. \`/?foo=bar\`) where Spark returns the bare query (\`?foo=bar\`).
   3. **\`PATH\` on a URL with a bare trailing slash** returns \`""\` where 
Spark returns \`/\`.
   4. **3-arg \`parse_url(_, 'QUERY', key)\`** form-decodes the value 
(\`x%20y\` → \`x y\`) where Spark preserves the raw percent-encoding.
   
   The 2-arg \`parse_url(_, 'QUERY')\` does match Spark; only the 3-arg form 
with a key diverges.
   
   ### Suggested work plan
   
   1. Drive fixes upstream in DataFusion for #1-#4 above so we can pick them up 
via a \`datafusion-spark\` bump.
   2. Once #1-#4 are resolved, demote \`CometParseUrl.getSupportLevel\` from 
\`Incompatible\` to \`Compatible\` and remove the \`expect_fallback\` queries 
in \`spark/src/test/resources/sql-tests/expressions/url/parse_url.sql\`.
   3. Expand the native test (\`parse_url_native.sql\`) to cover the URL shapes 
currently in the fallback file (e.g. trailing-slash PATH, percent-encoded query 
value) so we have positive coverage once each divergence is fixed.
   4. Consider extending \`parse_url\` test coverage to invalid URLs in 
non-ANSI mode (Spark returns NULL for everything).
   
   ### Related
   
   - PR #4152 (initial implementation, this issue is the follow-up)
   - Upstream tracking: https://github.com/apache/datafusion/issues/21943


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to