alamb opened a new issue, #21301: URL: https://github.com/apache/datafusion/issues/21301
# Introduction With the rise of the importance of semi-structured data processing, I hear from more and more DataFusion users that they would like better support for JSON and [Parquet Variant](https://parquet.apache.org/docs/file-format/types/variantencoding/) functions (see [blog](https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/) for more details) Today, function libraries for DataFusion for these two types live in [datafusion-contrib ](https://github.com/datafusion-contrib) rather than in the [main Apache DataFusion repository](https://github.com/apache/datafusion): - [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant) - [datafusion-functions-json](https://github.com/datafusion-contrib/datafusion-functions-json) Keeping them outside the main repository has benefits such as faster iteration time and not being tied to releses, However, it also means they are 1. outside ASF governance and release processes 2. are less discoverable to users 3. Are harder to integrate downstream (need to wait for the next major DataFusion release) 4. Feel more experimental even when they solve common problems. 5. (maybe) Harder to attract outside maintenance investment # Proposal Bring these two crates into the main datafusion repository, similarly to how we have done for datafusion-spark. The crates would be optional (not part of `datafusion` or a feature flag) I think we would need the buy in from the maintainers/authors (largely pydanitic @adriangb @friendlymatthew and others) We previously did this for Spark-compatible functions by bringing [datafusion-functions-spark](https://github.com/apache/datafusion/tree/main/datafusion/spark) into the core DataFusion repo because the functionality was widely useful and maintaining it in one place made contribution and coordination easier. There was also recent discussion on the mailing list about using datafusion-json in the python bindings where this also came up: https://lists.apache.org/thread/f591qmhx97wsl7h5xfoh7sfhv2gh9t2k # Alternatives you've considered 1. Keep these crates in `datafusion-contrib` indefinitely. This keeps the core repo smaller and preserves flexibility, but leaves the crates outside the main project’s release and governance process. 2. Keep them in datafusion-contrib, but improve discoverability and documentation. This helps users find them, but does not address governance, release coordination, or long-term maintenance. 3. Bring in only one library at a time, starting with the most mature or most widely used. This is likely the lowest-risk path if there is agreement in principle but uncertainty about scope. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
