[
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-51162:
-----------------------------
Summary: SPIP: Add the TIME data type (was: [WIP] SPIP: Add the TIME data
type)
> SPIP: Add the TIME data type
> ----------------------------
>
> Key: SPARK-51162
> URL: https://issues.apache.org/jira/browse/SPARK-51162
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
> Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with
> fields hour, minute, second, up to microseconds. All operations over the type
> are performed without taking any time zone into account. New data type should
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard.
> *Q2. What problem is this proposal NOT designed to solve?*
> Don't support the TIME type with time zone defined by the SQL standard:
> *TIME\(n\) WITH TIME ZONE*.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the
> date part to the some constant value like 1970-01-01, 0001-01-01 or
> 0000-00-00 (though this is out of supported rage of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot
> recognize it in data sources, and for instance cannot load the TIME values
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by
> sub-tasks based on our experience of adding new types ANSI intervals and
> TIMESTAMP_NTZ.
> *Q5. Who cares? If you are successful, what difference will it make?*
> The new type simplifies migrations to Spark SQL from other DBMS like
> PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users
> don't have to rewrite their SQL code to emulate the TIME type. Also new
> functionality impacts on existing Spark SQL users who need to load data w/
> the TIME values that were stored by other systems.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can
> cause performance regressions. Such risk can be compensated by developing
> time benchmarks in parallel with supporting new type in different places in
> Spark SQL.
>
> *Q7. How long will it take?*
> In total it might take around *9 months*. The estimation is based on similar
> tasks: ANSI intervals
> ([SPARK-27790|https://issues.apache.org/jira/browse/SPARK-27790]) and
> TIMESTAMP_NTZ
> ([SPARK-35662|https://issues.apache.org/jira/browse/SPARK-35662]). We can
> split the work by function blocks:
> # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and
> external types.
> # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and
> other built-in data types, partitioning, stats, predicate push down.
> # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
> # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
> # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
> # Docs + testing/benchmarking - *1 month*
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term is in 4 month: basic functionality, read/write new type to
> built-in datasources, basic time operations such as arithmetic ops, casting.
> The final "exams" is to support the same functionality as other time types:
> TIMESTAMP_NTZ, DATE, TIMESTAMP.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]