[I] String to bytes autocasting [pinot]

via GitHub Wed, 21 Feb 2024 02:07:47 -0800


gortiz opened a new issue, #12457:
URL: https://github.com/apache/pinot/issues/12457


   # The problem
   
   It is not well defined and not well implemented how to cast Strings to 
bytes. Sometimes we cast them as using `hexToBinary` function. Sometimes we 
fail. And in future changes in Calcite we may also sometimes cast them 
following the Postgres not documented (and not recommended) semantics.
   
   There are some PRs that change the behavior, but I think it is important to 
first understand the problem in deep and decide what semantics we want to 
follow and how that is going to affect the backward compatibility between V1 
and V2 (aka multistage engine).
   
   Note: The text of this issue is very long. In order to make it easier to 
read I used detail sections like the following:
   <details>
   <summary>Remember you can click on rows like this</summary>
   To open the details that would normally make very difficult to read the 
whole text
   </details>
   
   # Solutions
   ## To the casting issues specific to V2
   1. Keep V1 compatibility and cast Strings to Bytes using `hexToBinary`. 
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
   `select * from starbucksStores where location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202'` would be equivalent to 
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
bytesToHex(location_st_point)`  would be equivalent to `select * from 
starbucksStores where location_st_point = 
hexToBytes(bytesToHex(location_st_point))`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = address` would be 
equivalent to `select * from starbucksStores where location_st_point = 
hexToBytes(address)`
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates 
this is a byte literal in `hex` notation.
   </li>
   </ul>
    </details>
   
   2. Keep _standard_ compatibility and behave like Postgres. As far as we know 
there is no standard definition here, but both Postgres and MySQL seem to 
behave similarly.
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
    `select * from starbucksStores where location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202'` would be equivalent to 
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
asciiToBytes('80c062bc98021f94f1404e9bda0f6b0202')`. It will return different 
values than in V1. In this case it will return an empty set.
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
bytesToHex(location_st_point)`  would be equivalent to `select * from 
starbucksStores where location_st_point = 
asciiToBytes(bytesToHex(location_st_point))`. It will return different values 
than in V1. It will always return an empty set.
   'select * from starbucksStores where location_st_point = address` would be 
equivalent to `select * from starbucksStores where location_st_point = 
asciiToBytes(address)`. It will return different values than in V1.
   </li>
   <li>
   `select * from starbucksStores where location_st_point = 
x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates 
this is a byte literal in `hex` notation.
   </li>
   </ul>
    </details>
   
   3. Consider this cast illegal. In order to compare bytes with Strings 
customers would need to specify the conversion function. 
   <details>
   <summary>For example:</summary>
   <ul>
   <li>
      `select * from starbucksStores where location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202'` would be illegal. Bytes and Strings 
cannot be compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = 
bytesToHex(location_st_point)` would be illegal. Bytes and Strings cannot be 
compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = address` would 
be illegal. Bytes and Strings. Bytes and Strings cannot be compared.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = 
hexToBytes(address)` would be legal. Here we are specifying how to convert the 
String to bytes.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = 
hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')` would be legal. Here we are 
specifying how to convert the String to bytes.
    </li>
    <li>
      `select * from starbucksStores where location_st_point = 
x'80c062bc98021f94f1404e9bda0f6b0202'` would be legal. Here the `x` indicates 
this is a byte literal in `hex` notation.
    </li>
   </ul>
    </details>
   
   ## To the casting issues specific to V1
   Probably keep them as they are right now.
   
   ## To the casting issues specific to leaf stage
   1. Eventually modify leaf stage to do not relay on V1 given the know issues 
we have there.
   
   # Example
   
   Using GenericQuickstart:
   * The expression 
   ```select * from starbucksStores where location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202' limit 10``` is valid using the V1 but not 
in V2. 
   * The expression
   ```select * from starbucksStores where location_st_point = 
X'80c062bc98021f94f1404e9bda0f6b0202' limit 10``` parses in both V1 and V2, but 
in V1 returns 0 rows while in V2 returns the row whose bytes are equal to 
`hexToBytes('80c062bc98021f94f1404e9bda0f6b0202')`.
   * The expression ```select * from starbucksStores where location_st_point = 
bytesToHex(location_st_point) limit 10``` fails in V1 with
   ```java.lang.IllegalArgumentException at 
org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readValuesSV(ForwardIndexReader.java:322)```
   while in V2 fails with:
   ```
   Error Code: 200
   
   QueryExecutionError:
   Received error query execution result block: {200=QueryExecutionError:
   org.apache.pinot.spi.exception.BadQueryRequestException: Caught exception 
while initializing transform function: cast
        at 
org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:332)
        at 
org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:327)
        at 
org.apache.pinot.core.operator.filter.ExpressionFilterOperator.<init>(ExpressionFilterOperator.java:69)
        at 
org.apache.pinot.core.plan.FilterPlanNode.constructPhysicalOperator(FilterPlanNode.java:246)
   ...
   Caused by: java.lang.IllegalArgumentException: Unable to cast expression to 
type - BYTES
        at 
org.apache.pinot.core.operator.transform.function.CastTransformFunction.init(CastTransformFunction.java:112)
        at 
org.apache.pinot.core.operator.transform.function.BaseTransformFunction.init(BaseTransformFunction.java:120)
   ```
   * The expression ```select location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202' from starbucksStores limit 10``` fail in 
both V1 and V2.
   * The expression ```select * from starbucksStores where location_st_point = 
CAST(address as BYTES) limit 10```` fails in V1 saying Strings cannot be cast 
to Bytes. In V2 the equivalent expression is ```select * from starbucksStores 
where location_st_point = CAST(address as BINARY) limit 10```, which fails in 
the same way.
   
   # How other databases convert Strings to Bytes
   
   Postgres defines different ways to declare `bytea` (their bytes type) 
literals 
([link](https://www.postgresql.org/docs/current/datatype-binary.html)). 
   * The `hex` format is the closest to what it seems we try to do. In order to 
use that, in Postgres you declare a String literal starting with `\x` followed 
by a even number of hex digits.
   * The `escape` format, which is a bit more complex and not recommended.
   
   Although it is not documented (or at least neither @Jackie-Jiang or I found 
a reference in the documentation) is that strings that do not match with the 
previous formats are automatically converted to `bytea` as the UTF8 
representation of the String. 
   
   <details>
     <summary>Examples</summary>
   
   ```postgres
   postgres=# create table asd(id int, bytes bytea);
   CREATE TABLE
   postgres=# insert into asd(id, bytes) VALUES (1, '\x1234');
   INSERT 0 1
   postgres=# select * from asd;
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# select * from asd where bytes = '\x1234';
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# select * from asd where bytes = '\x1234';
    id | bytes  
   ----+--------
     1 | \x1234
   (1 row)
   
   postgres=# explain select * from asd where bytes = '\x1234';
                        QUERY PLAN                      
   -----------------------------------------------------
    Seq Scan on asd  (cost=0.00..25.88 rows=6 width=36)
      Filter: (bytes = '\x1234'::bytea)
   (2 rows)
   
   postgres=# explain select * from asd where bytes = '1234';
                        QUERY PLAN                      
   -----------------------------------------------------
    Seq Scan on asd  (cost=0.00..25.88 rows=6 width=36)
      Filter: (bytes = '\x31323334'::bytea)
   (2 rows)
   ```
   
   The last two sentences (using explain) show how the literal is interpreted. 
Postgres uses the `hex` format to represent back the binaries and we can see 
that `'1234'` was transformed into `'\x31323334'`.
   </details>
   
   # What Calcite does
   Calcite recognizes different literals as well. For example `x'1234'` is 
interpreted as Postgres `'\x1234'. In the current version values like 
`80c062bc98021f94f1404e9bda0f6b0202` produce a failure, but 
https://github.com/apache/calcite/pull/3635 will change the behavior to behave 
as Postgres in this case as well.
   
   # What Pinot does in V2
   In V2 Pinot delegates validation to Calcite, so expressions all expressions 
like `location_st_point = <something of type string>` are validated into 
something like `location_st_point = CAST(<something of type string> as 
BINARY)`. The tricky thing here is that the behavior depends on whether 
`<something of type string>` is simplifiable or not.
   
   Having different semantics depending on whether we compare with a literal or 
a column is unexpected, difficult to understand and error prone and therefore 
not acceptable. This is one of the problems V1 has that we try to fix in V2. 
There is PR (https://github.com/apache/pinot/pull/12401) that modified V2 to 
always use `hexToBinary` in this case, but we may decide to always cast using 
Postgres semantic instead. This PR doesn't solve all the issues with this 
conversion. It just removes the dependency on whether the expression is 
simplifiable or not, but then the expression is sent to V1 and most problems in 
V1 still affects V2.
   
   <details>
     <summary>Details explaining why the behavior depends on whether the String 
is simplifiable or not</summary>
   
   ## When String expression is not simplifiable
   For example, the expression `where location_st_point = 
binaryToHex(location_st_point)` is translated to `filter location_st_point = 
CAST(bytesToHex(location_st_point) as BINARY)`. Here the simplifier cannot 
simplify `CAST(bytesToHex(location_st_point) as BINARY)` because it depends on 
a column (`location_st_point`). 
   
   Then V2 transforms that expression to something the leaf stage (which is 
basically V1) can understand. Specifically, `BINARY` is not a valid V1 type, so 
it is transformed to Bytes and the expression executed in the leaf stage is 
`CAST(bytesToHex(location_st_point) as BYTES)`. 
   
   The problem comes when the leaf stage evaluates the expression. Here we 
receive the most funny error saying that the query fails because String cannot 
be cast to bytes. This is in fact correct. As shown in the examples above, 
query ```select * from starbucksStores where location_st_point = CAST(address 
as BYTES) limit 10``` fails in the same way in V1.
   
   With [the patch](https://github.com/apache/pinot/pull/12401) this expression 
doesn't fail because the leaf stage doesn't actually sees the `CAST` function 
but `hexToBinary` instead.
   
   ## When String expression is simplifiable
   For example, `location_st_point = '80c062bc98021f94f1404e9bda0f6b0202'`. 
Then it is translated to `location_st_point = 
CAST('80c062bc98021f94f1404e9bda0f6b0202' as BINARY). Given we ask Calcite to 
simplify that and that CAST is a pure function with a single argument which is 
a literal, then Calcite simplifies the expression to the actual result of the 
casting. As explained above, right now that produces a failure in Calcite, but 
once (and if) https://github.com/apache/calcite/pull/3635 is merged, the 
expression will be transformed to `location_st_point = <postgres interpretation 
of '80c062bc98021f94f1404e9bda0f6b0202'>`.
   
   </details>
   
   # What Pinot does in V1
   
   Given the relax typing nature of V1, there are no implicit casts. Instead we 
have operations that may or may not be legal depending on contextual factors.
   
   ## Compare bytes column with String literal
   <details>
   <summary>As a where predicate</summary>
   For example, if we compare a bytes column with a String literal in a where 
clause, V1 will decide to transform the String literal into bytes in a way that 
is similar to `hexToBinary` function. Therefore 
   ```select * from starbucksStores where location_st_point = 
'80c062bc98021f94f1404e9bda0f6b0202' limit 10```
   Is equivalent to
   ```select * from starbucksStores where location_st_point = 
hexToBytes('80c062bc98021f94f1404e9bda0f6b0202') limit 10```
   
   This works because we end up doing a full scan in which the literal in the 
right side of the expression is transformed with `hexToBytes`.
   </details>
   
   <details>
   <summary>As a select value</summary>
   Contrary to the where case, here we enforce a type check. Specifically, in a 
query like ```select location_st_point = address from starbucksStores limit 
10``` we fail in `BinaryOperatorTransformationFunction.init` because left type 
is BYTES but right type is not:
   ```java
      if (_leftStoredType == DataType.BYTES || _rightStoredType == 
DataType.BYTES) {
         Preconditions.checkState(_leftStoredType == _rightStoredType, 
String.format(
             "Unsupported data type for comparison: [Left Transform Function 
[%s] result type is [%s], Right Transform "
                 + "Function [%s] result type is [%s]]", 
_leftTransformFunction.getName(), _leftStoredType,
             _rightTransformFunction.getName(), _rightStoredType));
       }
   ```
   
   There is a plot twist here. Although we should fail with `Unsupported data 
type for comparision.....` in fact we fail with 
   ```
   Caused by: java.lang.UnsupportedOperationException
        at 
org.apache.pinot.core.operator.transform.function.IdentifierTransformFunction.getName(IdentifierTransformFunction.java:56)
   ```
   because `IdentifierTransformFunction.getName` is not override, so the error 
show to the user is quite bad.
   
   In order to view the correct error we can trick the expression as something 
like: ```select hexToBytes(bytesToHex(location_st_point)) = 
substr(location_st_point, 1, -1) from starbucksStores limit 10``` which fails 
with the _expected_:
   ```
   Caused by: java.lang.IllegalStateException: Unsupported data type for 
comparison: [Left Transform Function [hexToBytes] result type is [BYTES], Right 
Transform Function [substr] result type is [STRING]]
        at 
com.google.common.base.Preconditions.checkState(Preconditions.java:512)
   ```
   </details>
   ## Compare bytes column with a not literal String
   <details>
   <summary>As where predicate</summary>
   For example `select * from starbucksStores where location_st_point = address 
limit 10` or `select * from starbucksStores where location_st_point = 
bytesToHex(location_st_point) limit 10`
   
   In this case V1 fails in a very strange way. Specifically, query ```select * 
from starbucksStores where location_st_point = address limit 10``` fails with:
   
   ```
   java.lang.IllegalArgumentException
        at 
org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readValuesSV(ForwardIndexReader.java:322)
        at 
org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkForwardIndexReader.readValuesSV(BaseChunkForwardIndexReader.java:444)
        at 
org.apache.pinot.segment.local.segment.index.readers.forward.BaseChunkForwardIndexReader.readValuesSV(BaseChunkForwardIndexReader.java:43)
        at 
org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readDoubleValues(DataFetcher.java:537)
   ```
   
   The reason is that V1 does not add implicit casts, so when it tries to 
execute `where location_st_point = address` it sees a comparison between 
different types and applies the function that compares doubles. Therefore it 
tries to convert `location_st_point` to double and fails because 
`ForwardIndexReader.readValuesSV(int[] docIds, int length, double[] values, T 
context)` doesn't know how to do that.
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[I] String to bytes autocasting [pinot]

Reply via email to