alamb opened a new issue, #21301:
URL: https://github.com/apache/datafusion/issues/21301

   # Introduction
   
   With the rise of the importance of semi-structured data processing, I hear 
from more and more DataFusion users that they would like better support for 
JSON and [Parquet 
Variant](https://parquet.apache.org/docs/file-format/types/variantencoding/) 
functions (see 
[blog](https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/)
 for more details)
   
   Today, function libraries for DataFusion for these two types live in 
[datafusion-contrib ](https://github.com/datafusion-contrib) rather than in the 
[main Apache DataFusion repository](https://github.com/apache/datafusion):
   
     - 
[datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant)
     - 
[datafusion-functions-json](https://github.com/datafusion-contrib/datafusion-functions-json)
   
   Keeping them outside the main repository has benefits such as faster 
iteration time and not being tied to releses, 
   
   However, it also means they are 
   1. outside ASF governance and release processes
   2. are less discoverable to users
   3. Are harder to integrate downstream (need to wait for the next major 
DataFusion release)
   4. Feel more experimental even when they solve common problems.
   5. (maybe) Harder to attract outside maintenance investment
   
   # Proposal
   Bring these two crates into the main datafusion repository, similarly to how 
we have done for datafusion-spark. The crates would be optional (not part of 
`datafusion` or a feature flag) 
   
   I think we would need the buy in from the maintainers/authors (largely 
pydanitic @adriangb @friendlymatthew and others)
   
   We previously did this for Spark-compatible functions by bringing 
[datafusion-functions-spark](https://github.com/apache/datafusion/tree/main/datafusion/spark)
 into the core DataFusion repo because the functionality was widely useful and 
maintaining it in one place made contribution and coordination easier.
   
     There was also recent discussion on the mailing list about using 
datafusion-json in the python bindings where this also came up: 
https://lists.apache.org/thread/f591qmhx97wsl7h5xfoh7sfhv2gh9t2k
   
     # Alternatives you've considered
   
     1. Keep these crates in `datafusion-contrib` indefinitely.
        This keeps the core repo smaller and preserves flexibility, but leaves 
the crates outside the main project’s release and governance process.
     2. Keep them in datafusion-contrib, but improve discoverability and 
documentation.
        This helps users find them, but does not address governance, release 
coordination, or long-term maintenance.
     3. Bring in only one library at a time, starting with the most mature or 
most widely used.
        This is likely the lowest-risk path if there is agreement in principle 
but uncertainty about scope.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to