LucasEby opened a new pull request, #17377: URL: https://github.com/apache/pinot/pull/17377
Fixes #17376 ### Motivation Multiple tests in `JsonIngestionFromAvroQueriesTest` exhibit non-deterministic failures due to a combination of issues related to HashMap iteration order and incorrect dictionary lookup behavior: 1. Dictionary lookup silently returns incorrect values from row 0 on unsuccessful key match instead of failing. [SegmentDictionaryCreator](https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentDictionaryCreator.java) uses FastUtil's [Double2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/doubles/Double2IntFunction.html#defaultReturnValue()), [Float2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/floats/Float2IntMap.html#defaultReturnValue()), [Int2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/ints/Int2IntFunction.html#defaultReturnValue()), [Long2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/dsi/fastutil/longs/Long2IntFunction.html#defaultReturnValue()), and [Object2IntOpenHashMap](https://javadoc.io/doc/it.unimi.dsi/fastutil/latest/it/unimi/d si/fastutil/objects/Object2IntFunction.html#defaultReturnValue()) for dictionary lookups. According to FastUtil documentation, these maps return a default value of 0 when a key is not found via get() or getInt(). This causes data corruption: when a lookup fails, the code silently uses dictionary ID 0, causing row 0's value to incorrectly appear in other rows. 2. Non-deterministic HashMap iteration in Avro deserialization. Apache Avro's `GenericDatumReader` uses HashMap internally (via `GenericData.newMap()`), which provides no iteration order guarantees per the HashMap specification. Serializing and deserializing the Avro records containing Maps may result in different key orders across ingestion passes (on storage and indexing). 3. JSON key-value pair ordering is not semantically significant. Per the [JSON specification](https://www.json.org/json-en.html), "an object is an unordered set of name/value pairs". String comparison of JSON objects with different key orderings incorrectly treats `{"a":"1","b":"2"}` and `{"b":"2","a":"1"}` as distinct values, when they are semantically equivalent. The ordering for each of these order-related problems can change due to different environments producing the contents in different orders despite the logical contents being the same. This harmless re-ordering can surface as test failures and allow incorrect dictionary values to be silently used. ### Modifications #### SegmentDictionaryCreator.java The FastUtil dictionary maps have been explicitly configured to return -1 on missing keys to prevent silent corruption. Dictionary lookups have been wrapped in `checkIdx` which returns the dictionary ID when found, attempts JSON-aware normalization for `string` values, and throws a descriptive `IllegalStateException` if no match exists. JSON-aware comparison utilities using Jackson have also been added which perform structural comparison via `JsonNode.equals()` to ignore object key order and to preserve array ordering semantics. #### JsonIngestionFromAvroQueriesTest Added `assertJsonEquals()` to compare JSON values structurally rather than lexically. `testSimpleSelectOnJsonColumn()` and `testComplexSelectOnJsoncolumn()` were refactored to apply column structural comparison only to JSON columns and to use column-level assertions (if needed). In essence, these changes keep the spirit of the original code while eliminating silent corruptions of data and failures caused by allowed (but previously unexpected) reordering. ### Verifying this change - [x] Make sure that the change passes all checks by running `mvn clean install -Pbin-dist` locally. Fixed tests - `JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumn` - `JsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumn` - `JsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
