[GitHub] [incubator-pinot] npawar commented on issue #5509: Derived columns

GitBox Wed, 14 Oct 2020 16:07:10 -0700


npawar commented on issue #5509:
URL: 
https://github.com/apache/incubator-pinot/issues/5509#issuecomment-708706434



   **Challenges**
   Although this seems exactly like transform functions, there's some 
differences because of which we cannot handle this solely as regular transform 
functions.
   
   Say we have columns `a, b, c` in the raw data source.
   Say we have columns `a, c, x, y` in the Pinot schema, such that
   ```
   a -> a
   c -> c
   x -> f (b)
   y -> f(a, c)
   
   ```
   1) The way transform functions are designed right now, arguments to 
transform functions can only be from `a, b, c`.  If we wanted to add `z -> f(x, 
y)`, this would not be supported.
   2) Transform functions are only evaluated during ingestion. In case of 
derived columns, we want to support adding even after segment creation i.e. add 
some derived columns to an existing schema, and see the new values in the 
segments after a reload.
   
   **Changes**
   1 is easy to fix, and can be done by simply enhancing the support for 
transform configs. Some changes needed:
   - Remove validations which prevent adding transform functions of the derived 
kind (i.e. y = f(z) and x = f(y) is blocked in Table config validations right 
now, can be easily removed).
   - In the ExpressionTransformer, we simply identify the derived columns, and 
evaluate them after the non-derived fields.
   
   Handling 2 will need some more changes. We need to start computing the 
derived field's transform functions, during segment reload. For this, we could 
piggyback on the `DefaultColumnHandler`. Similar to 
`BaseDefaultColumnHandler#updateDefaultColumn`, we can introduce a 
`BaseDefaultColumnHandler#updateDerivedColumns`, which can be called instead of 
`updateDefaultColumn` if the column is a derived field.
   
   **Identifying derived fields**
   We also need a flag on the FieldSpec called `derived`. This flag will help 
us distinguish between derived fields and regular fields. Here's an example for 
why we need this:
   You may have `y = f(z) and x = f(y)`. Here x is obviously derived, as it is 
using `y` as arguments, and `y` is not in source data.
   You may also have `y = y and x = f(y)`. Here it is not obvious that x is a 
derived field or not, because y is both source and destination. 
   In both of the examples, `x` will be evaluated during segment creation. The 
deciding factor for whether a field is derived or not, is whether user wants 
Pinot to generate the values during a segment reload, if the column was not 
already present.
   If user marks `x` as a derived column and reloads segments, all segments 
missing `x` should evaluate `x = f(y)` using value of `y` already in the 
segment.
   If user does not mark `x` as a derived column, during reload, all segments 
missing `x` should simply add default value for `x`.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] npawar commented on issue #5509: Derived columns

Reply via email to