LucasEby opened a new pull request, #17377:
URL: https://github.com/apache/pinot/pull/17377

   Fixes #17376
   
   ### Motivation
   
   Multiple tests in `JsonIngestionFromAvroQueriesTest` exhibit 
non-deterministic failures due to a combination of issues related to HashMap 
iteration order and incorrect dictionary lookup behavior:
   
   1. Dictionary lookup silently returns incorrect values from row 0 on 
unsuccessful key match instead of failing. 
[SegmentDictionaryCreator](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentDictionaryCreator.java)
 uses FastUtil's 
[Double2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/doubles/Double2IntFunction.html#defaultReturnValue()),
 
[Float2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/floats/Float2IntMap.html#defaultReturnValue()),
 
[Int2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/ints/Int2IntFunction.html#defaultReturnValue()),
 
[Long2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/longs/Long2IntFunction.html#defaultReturnValue()),
 and 
[Object2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/d
 si/fastutil/objects/Object2IntFunction.html#defaultReturnValue()) for 
dictionary lookups. According to FastUtil documentation, these maps return a 
default value of 0 when a key is not found via get() or getInt(). This causes 
data corruption: when a lookup fails, the code silently uses dictionary ID 0, 
causing row 0's value to incorrectly appear in other rows. 
   
   2. Non-deterministic HashMap iteration in Avro deserialization.
   Apache Avro's `GenericDatumReader` uses HashMap internally (via 
`GenericData.newMap()`), which provides no iteration order guarantees per the 
HashMap specification. Serializing and deserializing the Avro records 
containing Maps may result in different key orders across ingestion passes (on 
storage and indexing). 
   
   3. JSON key-value pair ordering is not semantically significant.
   Per the [JSON specification](https://www.json.org/json-en.html), "an object 
is an unordered set of name/value pairs". String comparison of JSON objects 
with different key orderings incorrectly treats `{"a":"1","b":"2"}` and 
`{"b":"2","a":"1"}` as distinct values, when they are semantically equivalent.
   
   The ordering for each of these order-related problems can change due to 
different environments producing the contents in different orders despite the 
logical contents being the same. This harmless re-ordering can surface as test 
failures and allow incorrect dictionary values to be silently used. 
   
   ### Modifications
   #### SegmentDictionaryCreator.java
   The FastUtil dictionary maps have been explicitly configured to return -1 on 
missing keys to prevent silent corruption. Dictionary lookups have been wrapped 
in `checkIdx` which returns the dictionary ID when found, attempts JSON-aware 
normalization for `string` values, and throws a descriptive 
`IllegalStateException` if no match exists. JSON-aware comparison utilities 
using Jackson have also been added which perform structural comparison via 
`JsonNode.equals()` to ignore object key order and to preserve array ordering 
semantics. 
   
   #### JsonIngestionFromAvroQueriesTest
   Added `assertJsonEquals()` to compare JSON values structurally rather than 
lexically. `testSimpleSelectOnJsonColumn()` and 
`testComplexSelectOnJsoncolumn()` were refactored to apply column structural 
comparison only to JSON columns and to use column-level assertions (if needed). 
   
   In essence, these changes keep the spirit of the original code while 
eliminating silent corruptions of data and failures caused by allowed (but 
previously unexpected) reordering. 
   
   ### Verifying this change
   
   - [x] Make sure that the change passes all checks by running `mvn clean 
install -Pbin-dist` locally.
   
   Fixed tests
   - `JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumn`
   - `JsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumn`
   - `JsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to