[GitHub] [pinot] xiangfu0 commented on pull request #11117: [multistage] bridge v2 query engine for leaf stage v1 multi-value column

via GitHub Fri, 21 Jul 2023 12:48:14 -0700


xiangfu0 commented on PR #11117:
URL: https://github.com/apache/pinot/pull/11117#issuecomment-1646168883


   > the grand question is whether we continue to support MV as a data type or 
collapse into ARRAY. if the answer to the above question is NO, this PR is not 
needed. if the answer is YES. there are several concerns. i thought about it a 
bit last night. the problem is 2 fold.
   > 
   > 1. how do we support it conforming with ANSI syntax
   > 2. how do we make sure v1 perform the same as before.
   > 
   > for Q1: the answer is probably MULTI_SET
   > 
   > * it is a SET with no ordering but can contain duplicates, suitable for 
our use case for GROUP-BY and FILTER, which they are considered as 
expanded/unnested during eval.
   > * but not every use cases is considered as MULTI_SET, for example 
selection will consider MV as an ARRAY
   >   therefore,
   > 
   > 1. an explicit `USE_AS_MULTISET(mv)` is desired to bridge the syntatic gap
   > 2. explicit `USE_AS_ARRAY(mv)` is the default (considering select * 
there's no reason to ask user explicitly put this in
   > 
   > for Q2: the problem is these USE_AS_*** methods are considered transform, 
which shouldn't be the case there's a simply way to solve this --> in 
CalciteRexExpressionUtils, we can simply ignored the functionCall and directly 
return the operand as a reference in V1 b/c V1 is SqlNode and bares no type 
info, this means directly putting the operand input reference without the 
USE_AS_*** type conversion will work naturally with the V1 context.
   > 
   > so in short my proposal is
   > 
   > 1. have `USE_AS_ARRAY` and `USE_AS_MULTISET` as scalarFunction wrappers 
(unimplemented) to bridge the gap of the syntactic problem on calcite
   > 2. in PhysicalPlanner explicitly exclude these syntatitic functions and 
directly drop to the operand;
   > 3. in the mid term we will have MV, MULTI_SET, ARRAY as 3 "co-existing" 
types, but in V1 there's only MV; in V2 MV is considered by default as ARRAY, 
unless `USE_AS_MULTISET` is used.
   > 4. in the long term, once we fully found all alternatives in standard SQL 
syntax for MV, we can safely consider MV as ARRAY. and only do MULTISET when 
necessary (i can only see it being useful in `MV > 5` situation, where there's 
no array equivalent, only multiset equivalent)
   > 
   > WDYT? Please let me know
   
   MV column in V2 is modeled as ARRAY by default.
   So in terms of supporting v1 format, we need to:
   1. use `ARRAY_TO_MV` or `USE_AS_MV` as the bridge to ensure Calcite 
understand the type consistency/conversion.
   2. (TODO) Implement MV GROUP BY in v2 intermediate stage to complete the 
story.
   3. (TODO) Implement ARRAY GROUP BY later on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] xiangfu0 commented on pull request #11117: [multistage] bridge v2 query engine for leaf stage v1 multi-value column

Reply via email to