[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5451: Refactor DistinctTable to use PriorityQueue based algorithm

GitBox Fri, 29 May 2020 12:02:14 -0700


siddharthteotia commented on a change in pull request #5451:
URL: https://github.com/apache/incubator-pinot/pull/5451#discussion_r432678859




##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/DistinctAggregationFunction.java
##########
@@ -123,21 +120,20 @@ public void aggregate(int length, AggregationResultHolder 
aggregationResultHolde
         columnDataTypes[i] = 
ColumnDataType.fromDataTypeSV(blockValSetMap.get(_inputExpressions.get(i)).getValueType());
       }
       DataSchema dataSchema = new DataSchema(_columns, columnDataTypes);
-      distinctTable = new DistinctTable(dataSchema, _orderBy, _capacity);
+      distinctTable = new DistinctTable(dataSchema, _orderBy, _limit);
       aggregationResultHolder.setValue(distinctTable);
+    } else if (distinctTable.shouldNotAddMore()) {
+      return;
     }
 
-    // TODO: Follow up PR will make few changes to start using 
DictionaryBasedAggregationOperator
-    // for DISTINCT queries without filter.
+    // TODO: Follow up PR will make few changes to start using 
DictionaryBasedAggregationOperator for DISTINCT queries
+    //       without filter.
     RowBasedBlockValueFetcher blockValueFetcher = new 
RowBasedBlockValueFetcher(blockValSets);
 
-    // TODO: Do early termination in the operator itself which should
-    // not call aggregate function at all if the limit has reached
-    // that will require the interface change since this function
-    // has to communicate back that required number of records have
-    // been collected
     for (int i = 0; i < length; i++) {
-      distinctTable.upsert(new Record(blockValueFetcher.getRow(i)));
+      if (!distinctTable.add(new Record(blockValueFetcher.getRow(i)))) {

Review comment:
       I think this for loop should be written separately for order by and non 
order by.
   
   For order by, there is no early termination so if check can be avoided since 
the return value will always be true.
   For non order, after adding every record, check the return value to see if 
limit has been reached.

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/customobject/DistinctTable.java
##########
@@ -0,0 +1,297 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.query.aggregation.function.customobject;
+
+import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.pinot.common.request.SelectionSort;
+import org.apache.pinot.common.utils.DataSchema;
+import org.apache.pinot.common.utils.DataTable;
+import org.apache.pinot.core.common.datatable.DataTableBuilder;
+import org.apache.pinot.core.common.datatable.DataTableFactory;
+import org.apache.pinot.core.data.table.Record;
+import org.apache.pinot.spi.utils.ByteArray;
+
+
+/**
+ * The {@code DistinctTable} class serves as the intermediate result of {@code 
DistinctAggregationFunction}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public class DistinctTable {
+  private static final int MAX_INITIAL_CAPACITY = 10000;
+
+  private final DataSchema _dataSchema;
+  private final int _limit;
+  private final Set<Record> _uniqueRecords;
+  private final PriorityQueue<Record> _sortedRecords;
+  private final List<Record> _records;

Review comment:
       A comment that "list is not used on the server and/or is only for broker 
deserialized and reduce" would be good to have.

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/customobject/DistinctTable.java
##########
@@ -0,0 +1,297 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.query.aggregation.function.customobject;
+
+import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.pinot.common.request.SelectionSort;
+import org.apache.pinot.common.utils.DataSchema;
+import org.apache.pinot.common.utils.DataTable;
+import org.apache.pinot.core.common.datatable.DataTableBuilder;
+import org.apache.pinot.core.common.datatable.DataTableFactory;
+import org.apache.pinot.core.data.table.Record;
+import org.apache.pinot.spi.utils.ByteArray;
+
+
+/**
+ * The {@code DistinctTable} class serves as the intermediate result of {@code 
DistinctAggregationFunction}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public class DistinctTable {
+  private static final int MAX_INITIAL_CAPACITY = 10000;
+
+  private final DataSchema _dataSchema;
+  private final int _limit;
+  private final Set<Record> _uniqueRecords;
+  private final PriorityQueue<Record> _sortedRecords;
+  private final List<Record> _records;
+
+  /**
+   * Constructor of the main {@code DistinctTable} which can be used to add 
records and merge other
+   * {@code DistinctTable}s.
+   */
+  public DistinctTable(DataSchema dataSchema, @Nullable List<SelectionSort> 
orderBy, int limit) {
+    _dataSchema = dataSchema;
+    _limit = limit;
+
+    // TODO: see if 10k is the right max initial capacity to use
+    // NOTE: When LIMIT is smaller than or equal to the MAX_INITIAL_CAPACITY, 
no resize is required.
+    int initialCapacity = Math.min(limit, MAX_INITIAL_CAPACITY);
+    _uniqueRecords = new ObjectOpenHashSet<>(initialCapacity);
+    if (orderBy != null) {
+      String[] columns = dataSchema.getColumnNames();
+      int numColumns = columns.length;
+      Object2IntOpenHashMap<String> columnIndexMap = new 
Object2IntOpenHashMap<>(numColumns);
+      for (int i = 0; i < numColumns; i++) {
+        columnIndexMap.put(columns[i], i);
+      }
+      int numOrderByColumns = orderBy.size();
+      int[] orderByColumnIndexes = new int[numOrderByColumns];
+      boolean[] orderByAsc = new boolean[numOrderByColumns];
+      for (int i = 0; i < numOrderByColumns; i++) {
+        SelectionSort selectionSort = orderBy.get(i);
+        orderByColumnIndexes[i] = 
columnIndexMap.getInt(selectionSort.getColumn());
+        orderByAsc[i] = selectionSort.isIsAsc();
+      }
+      _sortedRecords = new PriorityQueue<>(initialCapacity, (record1, record2) 
-> {
+        Object[] values1 = record1.getValues();
+        Object[] values2 = record2.getValues();
+        for (int i = 0; i < numOrderByColumns; i++) {
+          Comparable valueToCompare1 = (Comparable) 
values1[orderByColumnIndexes[i]];
+          Comparable valueToCompare2 = (Comparable) 
values2[orderByColumnIndexes[i]];
+          int result =
+              orderByAsc[i] ? valueToCompare2.compareTo(valueToCompare1) : 
valueToCompare1.compareTo(valueToCompare2);
+          if (result != 0) {
+            return result;
+          }
+        }
+        return 0;
+      });
+    } else {
+      _sortedRecords = null;
+    }
+    _records = null;
+  }
+
+  /**
+   * Returns the {@code DataSchema} of the {@code DistinctTable}.
+   */
+  public DataSchema getDataSchema() {
+    return _dataSchema;
+  }
+
+  /**
+   * Returns the number of unique records within the {@code DistinctTable}.
+   */
+  public int size() {
+    if (_uniqueRecords != null) {
+      // Server-side
+      return _uniqueRecords.size();
+    } else {
+      // Broker-side
+      return _records.size();
+    }
+  }
+
+  /**
+   * Adds a record into the DistinctTable and returns whether more records 
should be added. No more records should be
+   * added iff there is no order-by column and enough unique records have been 
collected.
+   */
+  public boolean add(Record record) {
+    if (_uniqueRecords.contains(record)) {

Review comment:
       We should not be checking for contains() on every add(). This will hurt 
performance since the add() method on the hashset() later on will anyway do a 
contains internally and will accordingly decide to add or ignore the new item. 
It will also return true or false if the item was added or not. At least, the 
java hashset works like this. 
   
   I don't see the purpose of this if-check in lines 126-128. There are two 
cases:
   
   - **No order by:** simply do.add() and return if the limit has been reached 
or not
   - **Order by:** If limit hasn't reached, add to both set and PQ. Else poll 
from PQ, delete from set and add to both. Always return true in this case.
   
   Lines 129-145 handle both the cases.
   

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/customobject/DistinctTable.java
##########
@@ -0,0 +1,297 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.query.aggregation.function.customobject;
+
+import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.pinot.common.request.SelectionSort;
+import org.apache.pinot.common.utils.DataSchema;
+import org.apache.pinot.common.utils.DataTable;
+import org.apache.pinot.core.common.datatable.DataTableBuilder;
+import org.apache.pinot.core.common.datatable.DataTableFactory;
+import org.apache.pinot.core.data.table.Record;
+import org.apache.pinot.spi.utils.ByteArray;
+
+
+/**
+ * The {@code DistinctTable} class serves as the intermediate result of {@code 
DistinctAggregationFunction}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public class DistinctTable {
+  private static final int MAX_INITIAL_CAPACITY = 10000;
+
+  private final DataSchema _dataSchema;
+  private final int _limit;
+  private final Set<Record> _uniqueRecords;
+  private final PriorityQueue<Record> _sortedRecords;
+  private final List<Record> _records;
+
+  /**
+   * Constructor of the main {@code DistinctTable} which can be used to add 
records and merge other
+   * {@code DistinctTable}s.
+   */
+  public DistinctTable(DataSchema dataSchema, @Nullable List<SelectionSort> 
orderBy, int limit) {
+    _dataSchema = dataSchema;
+    _limit = limit;
+
+    // TODO: see if 10k is the right max initial capacity to use
+    // NOTE: When LIMIT is smaller than or equal to the MAX_INITIAL_CAPACITY, 
no resize is required.
+    int initialCapacity = Math.min(limit, MAX_INITIAL_CAPACITY);
+    _uniqueRecords = new ObjectOpenHashSet<>(initialCapacity);
+    if (orderBy != null) {
+      String[] columns = dataSchema.getColumnNames();
+      int numColumns = columns.length;
+      Object2IntOpenHashMap<String> columnIndexMap = new 
Object2IntOpenHashMap<>(numColumns);
+      for (int i = 0; i < numColumns; i++) {
+        columnIndexMap.put(columns[i], i);
+      }
+      int numOrderByColumns = orderBy.size();
+      int[] orderByColumnIndexes = new int[numOrderByColumns];
+      boolean[] orderByAsc = new boolean[numOrderByColumns];
+      for (int i = 0; i < numOrderByColumns; i++) {
+        SelectionSort selectionSort = orderBy.get(i);
+        orderByColumnIndexes[i] = 
columnIndexMap.getInt(selectionSort.getColumn());
+        orderByAsc[i] = selectionSort.isIsAsc();
+      }
+      _sortedRecords = new PriorityQueue<>(initialCapacity, (record1, record2) 
-> {
+        Object[] values1 = record1.getValues();
+        Object[] values2 = record2.getValues();
+        for (int i = 0; i < numOrderByColumns; i++) {
+          Comparable valueToCompare1 = (Comparable) 
values1[orderByColumnIndexes[i]];
+          Comparable valueToCompare2 = (Comparable) 
values2[orderByColumnIndexes[i]];
+          int result =
+              orderByAsc[i] ? valueToCompare2.compareTo(valueToCompare1) : 
valueToCompare1.compareTo(valueToCompare2);
+          if (result != 0) {
+            return result;
+          }
+        }
+        return 0;
+      });
+    } else {
+      _sortedRecords = null;
+    }
+    _records = null;
+  }
+
+  /**
+   * Returns the {@code DataSchema} of the {@code DistinctTable}.
+   */
+  public DataSchema getDataSchema() {
+    return _dataSchema;
+  }
+
+  /**
+   * Returns the number of unique records within the {@code DistinctTable}.
+   */
+  public int size() {
+    if (_uniqueRecords != null) {
+      // Server-side
+      return _uniqueRecords.size();
+    } else {
+      // Broker-side
+      return _records.size();
+    }
+  }
+
+  /**
+   * Adds a record into the DistinctTable and returns whether more records 
should be added. No more records should be
+   * added iff there is no order-by column and enough unique records have been 
collected.
+   */
+  public boolean add(Record record) {
+    if (_uniqueRecords.contains(record)) {
+      return true;
+    }
+    if (_sortedRecords != null) {
+      if (_sortedRecords.size() < _limit) {

Review comment:
       A comment stating "order by" would be nice. Similarly "no order by" 
before line 144.

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/customobject/DistinctTable.java
##########
@@ -0,0 +1,297 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.query.aggregation.function.customobject;
+
+import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.pinot.common.request.SelectionSort;
+import org.apache.pinot.common.utils.DataSchema;
+import org.apache.pinot.common.utils.DataTable;
+import org.apache.pinot.core.common.datatable.DataTableBuilder;
+import org.apache.pinot.core.common.datatable.DataTableFactory;
+import org.apache.pinot.core.data.table.Record;
+import org.apache.pinot.spi.utils.ByteArray;
+
+
+/**
+ * The {@code DistinctTable} class serves as the intermediate result of {@code 
DistinctAggregationFunction}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public class DistinctTable {
+  private static final int MAX_INITIAL_CAPACITY = 10000;
+
+  private final DataSchema _dataSchema;
+  private final int _limit;
+  private final Set<Record> _uniqueRecords;
+  private final PriorityQueue<Record> _sortedRecords;
+  private final List<Record> _records;
+
+  /**
+   * Constructor of the main {@code DistinctTable} which can be used to add 
records and merge other
+   * {@code DistinctTable}s.
+   */
+  public DistinctTable(DataSchema dataSchema, @Nullable List<SelectionSort> 
orderBy, int limit) {
+    _dataSchema = dataSchema;
+    _limit = limit;
+
+    // TODO: see if 10k is the right max initial capacity to use
+    // NOTE: When LIMIT is smaller than or equal to the MAX_INITIAL_CAPACITY, 
no resize is required.
+    int initialCapacity = Math.min(limit, MAX_INITIAL_CAPACITY);
+    _uniqueRecords = new ObjectOpenHashSet<>(initialCapacity);
+    if (orderBy != null) {
+      String[] columns = dataSchema.getColumnNames();
+      int numColumns = columns.length;
+      Object2IntOpenHashMap<String> columnIndexMap = new 
Object2IntOpenHashMap<>(numColumns);
+      for (int i = 0; i < numColumns; i++) {
+        columnIndexMap.put(columns[i], i);
+      }
+      int numOrderByColumns = orderBy.size();
+      int[] orderByColumnIndexes = new int[numOrderByColumns];
+      boolean[] orderByAsc = new boolean[numOrderByColumns];
+      for (int i = 0; i < numOrderByColumns; i++) {
+        SelectionSort selectionSort = orderBy.get(i);
+        orderByColumnIndexes[i] = 
columnIndexMap.getInt(selectionSort.getColumn());
+        orderByAsc[i] = selectionSort.isIsAsc();
+      }
+      _sortedRecords = new PriorityQueue<>(initialCapacity, (record1, record2) 
-> {
+        Object[] values1 = record1.getValues();
+        Object[] values2 = record2.getValues();
+        for (int i = 0; i < numOrderByColumns; i++) {
+          Comparable valueToCompare1 = (Comparable) 
values1[orderByColumnIndexes[i]];
+          Comparable valueToCompare2 = (Comparable) 
values2[orderByColumnIndexes[i]];
+          int result =
+              orderByAsc[i] ? valueToCompare2.compareTo(valueToCompare1) : 
valueToCompare1.compareTo(valueToCompare2);
+          if (result != 0) {
+            return result;
+          }
+        }
+        return 0;
+      });
+    } else {
+      _sortedRecords = null;
+    }
+    _records = null;
+  }
+
+  /**
+   * Returns the {@code DataSchema} of the {@code DistinctTable}.
+   */
+  public DataSchema getDataSchema() {
+    return _dataSchema;
+  }
+
+  /**
+   * Returns the number of unique records within the {@code DistinctTable}.
+   */
+  public int size() {
+    if (_uniqueRecords != null) {
+      // Server-side
+      return _uniqueRecords.size();
+    } else {
+      // Broker-side
+      return _records.size();
+    }
+  }
+
+  /**
+   * Adds a record into the DistinctTable and returns whether more records 
should be added. No more records should be
+   * added iff there is no order-by column and enough unique records have been 
collected.
+   */
+  public boolean add(Record record) {
+    if (_uniqueRecords.contains(record)) {
+      return true;
+    }
+    if (_sortedRecords != null) {
+      if (_sortedRecords.size() < _limit) {
+        _uniqueRecords.add(record);
+        _sortedRecords.offer(record);
+      } else {
+        Record leastRecord = _sortedRecords.peek();
+        if (_sortedRecords.comparator().compare(record, leastRecord) > 0) {
+          _uniqueRecords.remove(leastRecord);
+          _uniqueRecords.add(record);
+          _sortedRecords.poll();
+          _sortedRecords.offer(record);
+        }
+      }
+      return true;
+    } else {
+      _uniqueRecords.add(record);
+      return _uniqueRecords.size() < _limit;
+    }
+  }
+
+  /**
+   * Returns {@code true} if no more records should be added, {@code false 
otherwise}. No more records should be added
+   * iff there is no order-by columns and enough unique records have been 
collected.
+   */
+  public boolean shouldNotAddMore() {

Review comment:
       So this can be useful for early termination for DISTINCT queries without 
ORDER BY?

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/customobject/DistinctTable.java
##########
@@ -0,0 +1,297 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.core.query.aggregation.function.customobject;
+
+import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.Set;
+import javax.annotation.Nullable;
+import org.apache.pinot.common.request.SelectionSort;
+import org.apache.pinot.common.utils.DataSchema;
+import org.apache.pinot.common.utils.DataTable;
+import org.apache.pinot.core.common.datatable.DataTableBuilder;
+import org.apache.pinot.core.common.datatable.DataTableFactory;
+import org.apache.pinot.core.data.table.Record;
+import org.apache.pinot.spi.utils.ByteArray;
+
+
+/**
+ * The {@code DistinctTable} class serves as the intermediate result of {@code 
DistinctAggregationFunction}.
+ */
+@SuppressWarnings({"rawtypes", "unchecked"})
+public class DistinctTable {
+  private static final int MAX_INITIAL_CAPACITY = 10000;
+
+  private final DataSchema _dataSchema;
+  private final int _limit;
+  private final Set<Record> _uniqueRecords;
+  private final PriorityQueue<Record> _sortedRecords;
+  private final List<Record> _records;
+
+  /**
+   * Constructor of the main {@code DistinctTable} which can be used to add 
records and merge other
+   * {@code DistinctTable}s.
+   */
+  public DistinctTable(DataSchema dataSchema, @Nullable List<SelectionSort> 
orderBy, int limit) {
+    _dataSchema = dataSchema;
+    _limit = limit;
+
+    // TODO: see if 10k is the right max initial capacity to use
+    // NOTE: When LIMIT is smaller than or equal to the MAX_INITIAL_CAPACITY, 
no resize is required.
+    int initialCapacity = Math.min(limit, MAX_INITIAL_CAPACITY);
+    _uniqueRecords = new ObjectOpenHashSet<>(initialCapacity);
+    if (orderBy != null) {
+      String[] columns = dataSchema.getColumnNames();
+      int numColumns = columns.length;
+      Object2IntOpenHashMap<String> columnIndexMap = new 
Object2IntOpenHashMap<>(numColumns);
+      for (int i = 0; i < numColumns; i++) {
+        columnIndexMap.put(columns[i], i);
+      }
+      int numOrderByColumns = orderBy.size();
+      int[] orderByColumnIndexes = new int[numOrderByColumns];
+      boolean[] orderByAsc = new boolean[numOrderByColumns];
+      for (int i = 0; i < numOrderByColumns; i++) {
+        SelectionSort selectionSort = orderBy.get(i);
+        orderByColumnIndexes[i] = 
columnIndexMap.getInt(selectionSort.getColumn());
+        orderByAsc[i] = selectionSort.isIsAsc();
+      }
+      _sortedRecords = new PriorityQueue<>(initialCapacity, (record1, record2) 
-> {
+        Object[] values1 = record1.getValues();
+        Object[] values2 = record2.getValues();
+        for (int i = 0; i < numOrderByColumns; i++) {
+          Comparable valueToCompare1 = (Comparable) 
values1[orderByColumnIndexes[i]];
+          Comparable valueToCompare2 = (Comparable) 
values2[orderByColumnIndexes[i]];
+          int result =
+              orderByAsc[i] ? valueToCompare2.compareTo(valueToCompare1) : 
valueToCompare1.compareTo(valueToCompare2);
+          if (result != 0) {
+            return result;
+          }
+        }
+        return 0;
+      });
+    } else {
+      _sortedRecords = null;
+    }
+    _records = null;
+  }
+
+  /**
+   * Returns the {@code DataSchema} of the {@code DistinctTable}.
+   */
+  public DataSchema getDataSchema() {
+    return _dataSchema;
+  }
+
+  /**
+   * Returns the number of unique records within the {@code DistinctTable}.
+   */
+  public int size() {
+    if (_uniqueRecords != null) {
+      // Server-side
+      return _uniqueRecords.size();
+    } else {
+      // Broker-side
+      return _records.size();
+    }
+  }
+
+  /**
+   * Adds a record into the DistinctTable and returns whether more records 
should be added. No more records should be
+   * added iff there is no order-by column and enough unique records have been 
collected.
+   */
+  public boolean add(Record record) {
+    if (_uniqueRecords.contains(record)) {
+      return true;
+    }
+    if (_sortedRecords != null) {
+      if (_sortedRecords.size() < _limit) {
+        _uniqueRecords.add(record);
+        _sortedRecords.offer(record);
+      } else {
+        Record leastRecord = _sortedRecords.peek();
+        if (_sortedRecords.comparator().compare(record, leastRecord) > 0) {
+          _uniqueRecords.remove(leastRecord);
+          _uniqueRecords.add(record);
+          _sortedRecords.poll();
+          _sortedRecords.offer(record);
+        }
+      }
+      return true;
+    } else {
+      _uniqueRecords.add(record);
+      return _uniqueRecords.size() < _limit;
+    }
+  }
+
+  /**
+   * Returns {@code true} if no more records should be added, {@code false 
otherwise}. No more records should be added
+   * iff there is no order-by columns and enough unique records have been 
collected.
+   */
+  public boolean shouldNotAddMore() {
+    return _sortedRecords == null && _uniqueRecords.size() == _limit;
+  }
+
+  /*
+   * SERVER ONLY METHODS
+   */
+
+  /**
+   * (Server-side) Merges another {@code DistinctTable} into the current one.
+   * <p>NOTE: {@code DistinctTable} on Server-side has non-null {@code 
_uniqueRecords}.
+   */
+  public void serverSideMerge(DistinctTable distinctTable) {
+    if (shouldNotAddMore()) {
+      return;
+    }
+    for (Record record : distinctTable._uniqueRecords) {
+      if (!add(record)) {
+        return;
+      }

Review comment:
       Here we seem to be doing extra check on the return value of add() for 
each record. If there is order by, add() will never return false right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5451: Refactor DistinctTable to use PriorityQueue based algorithm

Reply via email to