Copilot commented on code in PR #17186:
URL: https://github.com/apache/pinot/pull/17186#discussion_r2517101707


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollector.java:
##########
@@ -52,6 +54,10 @@ public class NoDictColumnStatisticsCollector extends 
AbstractColumnStatisticsCol
   private boolean _sealed = false;
   // HLL Plus generally returns approximate cardinality >= actual cardinality 
which is desired
   private final HyperLogLogPlus _hllPlus;
+  // Track exact uniques up to a threshold to avoid small-N underestimation 
and test flakiness
+  private static final int EXACT_UNIQUE_TRACKING_THRESHOLD = 2048;
+  // Volatile because this reference can be nulled out once the threshold is 
exceeded
+  private volatile Set<Object> _exactUniques = ConcurrentHashMap.newKeySet();

Review Comment:
   Using `volatile` for `_exactUniques` is insufficient for thread-safe access. 
The `getCardinality()` method reads this field and calls `size()` on the set, 
but if another thread nulls out `_exactUniques` between the null check and the 
`size()` call, a `NullPointerException` could occur. Consider using 
`AtomicReference<Set<Object>>` or synchronized blocks to ensure thread-safe 
access to both the reference and the set operations.
   ```suggestion
     // Use a lock object to synchronize access to _exactUniques for thread 
safety
     private final Object _exactUniquesLock = new Object();
     private Set<Object> _exactUniques = ConcurrentHashMap.newKeySet();
   ```



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/NoDictColumnStatisticsCollector.java:
##########
@@ -223,10 +239,35 @@ public void seal() {
 
   private void updateHllPlus(Object value) {
     if (value instanceof BigDecimal) {
-      // Canonicalize BigDecimal as string to avoid scale-related equality 
issues
+      // Canonicalize BigDecimal as string to avoid scale-related equality 
issues:
+      // BigDecimals with different scales (e.g., 1.0 vs 1.00) are not equal 
by default,
+      // but their string representations normalize the value for cardinality 
tracking.
       _hllPlus.offer(((BigDecimal) value).toString());
     } else {
       _hllPlus.offer(value);
     }
   }
+
+  private void trackExactUnique(Object value) {
+    Set<Object> exact = _exactUniques; // local snapshot to avoid 
check-then-act race
+    if (exact == null) {
+      return;
+    }
+    Object key;
+    if (value instanceof byte[]) {
+      key = new ByteArray((byte[]) value);
+    } else if (value instanceof BigDecimal) {
+      // Use string representation to avoid scale-related equality issues:
+      // BigDecimals with different scales (e.g., 1.0 vs 1.00) are not equal 
by default,
+      // but their string representations normalize the value for cardinality 
tracking.
+      key = ((BigDecimal) value).toString();
+    } else {
+      key = value;
+    }
+    exact.add(key);
+    if (exact.size() > EXACT_UNIQUE_TRACKING_THRESHOLD) {
+      // Drop exact tracking once the threshold is exceeded to avoid unbounded 
memory
+      _exactUniques = null;
+    }

Review Comment:
   Race condition: multiple threads can simultaneously check `exact.size() > 
EXACT_UNIQUE_TRACKING_THRESHOLD` and all attempt to null out `_exactUniques`. 
Additionally, other threads might still be adding to the set after it's been 
nulled, leading to continued memory growth beyond the threshold. Use 
`AtomicReference.compareAndSet()` or synchronization to ensure exactly one 
thread performs the nullification and that no further additions occur after the 
threshold is exceeded.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to