[ 
https://issues.apache.org/jira/browse/SPARK-53504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Milicevic updated SPARK-53504:
------------------------------------
    Attachment: TYPES_FRAMEWORK_DESIGN_V3.md

> Types framework
> ---------------
>
>                 Key: SPARK-53504
>                 URL: https://issues.apache.org/jira/browse/SPARK-53504
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: TYPES_FRAMEWORK_DESIGN_V3.md
>
>
> *Summary:*
> Introduce a Types Framework to centralize scattered type-specific pattern 
> matching
> *Description:*
> Adding a new data type to Spark currently requires modifying 50+ files with 
> scattered type-specific logic using diverse patterns (focusing on TIME type - 
> '_: TimeType', 'TimeNanoVector', '.hasTime()', 'LocalTimeEncoder', 
> 'instanceof TimeType', etc.). There is no compiler assistance to ensure 
> completeness, and integration points are easy to miss.
> This effort introduces a *Types Framework* that centralizes type-specific 
> infrastructure operations in Ops interface classes. Instead of modifying 
> dozens of files, a developer implementing a new type creates two Ops classes 
> and registers them in factory objects. The compiler enforces interface 
> completeness.
> *Concrete example - TimeType:*
> TimeType (the proof-of-concept for this framework) has integration points 
> spread across 50+ files for infrastructure concerns like physical type 
> mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto 
> conversion, JDBC, Python, Thrift, and storage formats. With the framework, 
> these are consolidated into two Ops classes (~240 lines total). A developer 
> adding a new type with similar complexity would create two analogous files 
> instead of touching 50+ files.
> *Architecture:*
> The framework defines two mandatory Ops traits and optional extension traits:
> {code:java}
> Mandatory (every type must implement):
> +-- TypeOps (sql/catalyst) - Physical type, literals, external type conversion
> +-- TypeApiOps (sql/api) - String formatting, row encoding 
> Optional (added per type as needed, in later phases):
> +-- ProtoTypeOps - Spark Connect proto serialization
> +-- ClientTypeOps - JDBC, Arrow, Python, Thrift integration
> +-- StorageTypeOps - Storage formats (Parquet, ORC, etc.){code}
> {{TypeOps.apply(dt)}} returns {{{}Option[TypeOps]{}}}, serving as both lookup 
> and existence check. Integration points use {{getOrElse}} to fall through to 
> legacy handling:
> {code:java}
> def someOperation(dt: DataType) =
>   TypeOps(dt).map(_.someMethod()).getOrElse {
>     dt match {
>       case DateType => ... // legacy types unchanged
>     }
>   }{code}
> TimeType serves as the proof-of-concept implementation. Once the framework is 
> validated, additional types can be migrated incrementally.
> *Scope:*
>  * In scope: Physical type representation, literal creation, type conversion, 
> string formatting, row encoding, proto serialization, client integration 
> (JDBC, Arrow, Python, Thrift), storage formats, testing infrastructure, 
> documentation
>  * Out of scope: Type-specific expressions and arithmetic, SQL parser 
> changes. The framework provides primitives that expressions use, not the 
> expressions themselves.
> *Note:*
> [2026/02/09] For now, only the basic set of sub-tasks has been created. This 
> is an estimation of what needs to be covered and it will most probably be 
> updated as development of the feature proceeds. Also, the scope of the 
> changes might be extended along the way, in case any uncovered topics have 
> been missed, or for example, if the proper way to abstract the expressions 
> has been figured out.
> *Design doc:*
> See the attached design document for full details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to