jorisvandenbossche opened a new issue, #41691:
URL: https://github.com/apache/arrow/issues/41691

   In several places in the Arrow specification and documentation we use the 
term "logical types", although we don't use it consistently and we don't 
actually have physical types (only physical layouts) to contrast it with.
   
   ### Current usage
   
   The Columnar Format doc page has a section called "Logical Types" 
(https://github.com/apache/arrow/pull/41685) to contrast those types from the 
physical layouts:
   
   > The 
[Schema.fbs](https://github.com/apache/arrow/blob/main/format/Schema.fbs) 
defines built-in logical types supported by the Arrow columnar format. Each 
logical type uses one of the above physical layouts. Nested logical types may 
have different physical layouts depending on the particular realization of the 
type.
   
   It explains an Array as having a logical data type, where _"Each logical 
data type has a well-defined physical layout."_
   
   The authoritative Schema.fbs also uses the term:
   
   
https://github.com/apache/arrow/blob/07a30d9a5784852187d100660325b8c12b4ff6c8/format/Schema.fbs#L18
   
   although it uses the term also in a "correct" way (but incorrect in the way 
we define the term currently):
   
   
https://github.com/apache/arrow/blob/07a30d9a5784852187d100660325b8c12b4ff6c8/format/Schema.fbs#L101-L105
   
   The Python docs 
(https://arrow.apache.org/docs/15.0/python/data.html#type-metadata):
   
   > We use the name **logical type** because the **physical** storage may be 
the same for one or more types. For example, ``int64``, ``float64``, and 
``timestamp[ms]`` all occupy 64 bits per value.
   
   Further, in various implementations the term is obviously used as well.
   
   In the Terminology section of the Columnar Format docs 
(https://arrow.apache.org/docs/15.0/format/Columnar.html#terminology), we 
define it as:
   
   > **Logical type**: An application-facing semantic value type that is 
implemented using some physical layout. For example, Decimal values are stored 
as 16 bytes in a fixed-size binary layout. Similarly, strings can be stored as 
``List<1-byte>``. A timestamp may be stored as 64-bit fixed-size layout.
   
   which is mostly correct with our current usage ("using some physical 
layout"), but it is also confusing that it explains strings as ``List<1-byte>`` 
as we have a different physical layout used for strings
   
   ### Previous discussion
   
   Generally we use the term relatively consistently to contrast "logical 
types" from the "physical layouts", but confusion around the terminology has 
come up regularly (what are "physical types" then? And extension types are 
essentially "logical types", but annotating our own logical types). This was 
specifically discussed in https://github.com/apache/arrow/issues/14752. 
   
   @amoeba proposed 
(https://github.com/apache/arrow/issues/14752#issuecomment-1550157549):
   
   > Still some discussion to be had about avoiding "logical" vs. "physical" in 
favor of "types" and "layouts" and possibly updating the format docs 
comprehensively
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to