Re: [PR] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField #12125 [lucene]

2024-01-15 Thread via GitHub


SreehariG73 commented on PR #12821:
URL: https://github.com/apache/lucene/pull/12821#issuecomment-1891530573

   Hello Adrien,
   
   I have made changes to the PR. Please review the same and let me know if it
   looks good
   
   regards,
   Sreehari Guruprasad
   
   On Mon, Jan 8, 2024 at 6:08 AM Adrien Grand ***@***.***>
   wrote:
   
   > @SreehariG73  Do you plan on removing the
   > unintended changes?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField [lucene]

2024-01-15 Thread via GitHub


jpountz closed issue #12125: Fix Field.java documentation to refer to new 
IntField/FloatField/LongField/DoubleField
URL: https://github.com/apache/lucene/issues/12125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fixed Field.java documentation to refer to new IntField/FloatField/Lo… [lucene]

2024-01-15 Thread via GitHub


jpountz merged PR #13012:
URL: https://github.com/apache/lucene/pull/13012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField #12125 [lucene]

2024-01-15 Thread via GitHub


jpountz closed pull request #12821: Fix Field.java documentation to refer to 
new IntField/FloatField/LongField/DoubleField #12125
URL: https://github.com/apache/lucene/pull/12821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField #12125 [lucene]

2024-01-15 Thread via GitHub


jpountz commented on PR #12821:
URL: https://github.com/apache/lucene/pull/12821#issuecomment-1891539895

   Thank you! I'll close this one then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub


jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1452096401


##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -240,11 +241,19 @@ public final PostingsEnum postings(Term term) throws 
IOException {
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
* @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param globalScoreQueue the global score queue used to track the top 
scores collected across
+   * all leaves
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
   public final TopDocs searchNearestVectors(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+  String field,
+  float[] target,
+  int k,
+  Bits acceptDocs,
+  int visitedLimit,
+  BlockingFloatHeap globalScoreQueue)

Review Comment:
   I wonder if we could introduce a `CollectorManager` similar to what we have 
for `TopDocs`? This global queue should be an implementation detail of the 
collector imo. 



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023

Review Comment:
   This interval is important since we need to ensure that we don't use the 
global queue too early. Did you check with different values? I'd expect that a 
smaller interval could affect recall.



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023

Review Comment:
   It's also important to check the order of execution. For instance what 
happens if all segments are executed serially (rather than in parallel), does 
it changes the recall?



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023
+  private final BlockingFloatHeap globalSimilarityQueue;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
*/
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, BlockingFloatHeap 
globalSimilarityQueue) {
 super(k, visitLimit);
+this.greediness = DEFAULT_GREEDINESS;
 this.queue = new NeighborQueue(k, false);
+this.globalSimilarityQueue = globalSimilarityQueue;
+
+if (globalSimilarityQueue == null) {
+  this.nonCompetitiveQueue = null;
+  this.updatesQueue = null;
+} else {
+  this.nonCompetitiveQueue = new FloatHeap(Math.max(1, Math.round((1 - 
greediness) * k)));
+  this.updatesQueue = new FloatHeap(k);
+}
   }
 
   @Override
   public boolean collect(int docId, float similarity) {
-return queue.insertWithOverflow(docId, similarity);
+boolean localSimUpdated = queue.insertWithOverflow(docId, similarity);
+boolean firstKResultsCollected = (kResultsCollected == false && 
queue.size() == k());
+if (firstKResultsCollected) {
+  kResultsCollected = true;
+}
+
+boolean globalSimUpdated = false;
+if (globalSimilarityQueue != null) {
+  updatesQueue.offer(similarity);
+  globalSimUpdated = nonCompetitiveQueue.offer(similarity);
+
+  if (kResultsCollected) {
+// as we've collected k results, we can start do periodic updates with 
the global queue
+if (firstKResultsCollected || (visitedCount & interval) == 0) {
+  cachedGlobalMinSim = 
globalSimilarityQueue.offer(updatesQueue.getHeap());

Review Comment:
   This 

Re: [I] Port PR management bot from Apache Beam [lucene]

2024-01-15 Thread via GitHub


stefanvodita commented on issue #12796:
URL: https://github.com/apache/lucene/issues/12796#issuecomment-1892098969

   The stale bot has been online for a week now and I like the effect it's had. 
It labeled almost 100 PRs as stale in total and we closed or un-staled 30 of 
those.
   
   Another small step we could take from here might be to label PRs based on 
the packages/modules they touch. Then we can filter for e.g. "module:facet" PRs 
or "module:spatial" PRs. These labels already exist, but I haven't seen them 
used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Backport #12829 to 9.x [lucene]

2024-01-15 Thread via GitHub


s1monw commented on PR #13013:
URL: https://github.com/apache/lucene/pull/13013#issuecomment-1892154344

   ups I made a bad assumpion. I need to revisit this...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-15 Thread via GitHub


Shradha26 commented on code in PR #12966:
URL: https://github.com/apache/lucene/pull/12966#discussion_r1449034518


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java:
##
@@ -17,20 +17,44 @@
 
 package org.apache.lucene.facet.taxonomy;
 
+import com.carrotsearch.hppc.IntArrayList;
+import com.carrotsearch.hppc.IntIntHashMap;
+import com.carrotsearch.hppc.cursors.IntIntCursor;
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.Collections;
 import java.util.Comparator;
+import java.util.HashMap;
 import java.util.List;
 import java.util.Locale;
+import java.util.Map;
 import org.apache.lucene.facet.FacetResult;
 import org.apache.lucene.facet.Facets;
 import org.apache.lucene.facet.FacetsCollector;
 import org.apache.lucene.facet.FacetsConfig;
 import org.apache.lucene.facet.FacetsConfig.DimConfig;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.facet.TopOrdAndIntQueue;
+import org.apache.lucene.facet.TopOrdAndNumberQueue;
+import org.apache.lucene.util.PriorityQueue;
 
 /** Base class for all taxonomy-based facets impls. */
 abstract class TaxonomyFacets extends Facets {
+  /** Intermediate result to store top children for a given path before 
resolving labels, etc. */
+  record TopChildrenForPath(Number pathValue, int childCount, 
TopOrdAndNumberQueue childQueue) {}
+
+  private static class DimValue {

Review Comment:
   [nit] should we call this just `Dim` and `String dimPath` instead of `String 
dim`? I see later that we've used `int dimValue` and this is getting quickly 
overloaded?



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java:
##
@@ -76,6 +111,78 @@ public int compare(FacetResult a, FacetResult b) {
 this.config = config;
 this.fc = fc;
 parents = taxoReader.getParallelTaxonomyArrays().parents();
+valueComparator = Comparator.comparingInt((x) -> (int) x);
+  }
+
+  /** Return true if a sparse hash table should be used for counting, instead 
of a dense int[]. */
+  private boolean useHashTable(FacetsCollector fc, TaxonomyReader taxoReader) {
+if (taxoReader.getSize() < 1024) {
+  // small number of unique values: use an array
+  return false;
+}
+
+if (fc == null) {
+  // counting all docs: use an array
+  return false;
+}
+
+int maxDoc = 0;
+int sumTotalHits = 0;
+for (FacetsCollector.MatchingDocs docs : fc.getMatchingDocs()) {
+  sumTotalHits += docs.totalHits;
+  maxDoc += docs.context.reader().maxDoc();
+}
+
+// if our result set is < 10% of the index, we collect sparsely (use hash 
map):
+return sumTotalHits < maxDoc / 10;
+  }
+
+  protected void initializeValueCounters() {
+if (initialized) {
+  return;
+}
+initialized = true;
+assert sparseCounts == null && counts == null;
+if (useHashTable(fc, taxoReader)) {
+  sparseCounts = new IntIntHashMap();
+} else {
+  counts = new int[taxoReader.getSize()];
+}
+  }
+
+  /** Set the count for this ordinal to {@code newValue}. */
+  protected void setCount(int ordinal, int newValue) {
+if (sparseCounts != null) {
+  sparseCounts.put(ordinal, newValue);
+} else {
+  counts[ordinal] = newValue;
+}
+  }
+
+  /** Get the count for this ordinal. */
+  protected int getCount(int ordinal) {
+if (sparseCounts != null) {
+  return sparseCounts.get(ordinal);
+} else {
+  return counts[ordinal];
+}
+  }
+
+  /** Get the aggregation value for this ordinal. */
+  protected Number getAggregationValue(int ordinal) {
+// By default, this is just the count.
+return getCount(ordinal);
+  }
+
+  /** Apply an aggregation to the two values and return the result. */
+  protected Number aggregate(Number existingVal, Number newVal) {
+// By default, we are computing counts, so the values are interpreted as 
integers and summed.
+return (int) existingVal + (int) newVal;

Review Comment:
   Can we use the concept of an aggregation function while combining in this 
method. (In line with my previous comment about making the logic for 
`IntTaxonomyFacets` and `FloatTaxonomyFacets` the default)



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java:
##
@@ -76,6 +111,78 @@ public int compare(FacetResult a, FacetResult b) {
 this.config = config;
 this.fc = fc;
 parents = taxoReader.getParallelTaxonomyArrays().parents();
+valueComparator = Comparator.comparingInt((x) -> (int) x);
+  }
+
+  /** Return true if a sparse hash table should be used for counting, instead 
of a dense int[]. */
+  private boolean useHashTable(FacetsCollector fc, TaxonomyReader taxoReader) {
+if (taxoReader.getSize() < 1024) {
+  // small number of unique values: use an array
+  return false;
+}
+
+if (fc == null) {
+  // counting all docs: use an array
+  return false;
+}
+
+

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub


mayya-sharipova commented on PR #12962:
URL: https://github.com/apache/lucene/pull/12962#issuecomment-1892646998

   @jimczi Thanks for your feedback.
   
   > Some of the recalls for the single segment baseline seem seem quite off 
(0.477?). Are you sure that there was no issue during the testing?
   
   Sorry, I made a mistake. I've updated the results. 
   
   I will work on more experiments to address your other comments/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-01-15 Thread via GitHub


mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1452703248


##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023
+  private final BlockingFloatHeap globalSimilarityQueue;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
 
   /**
* @param k the number of neighbors to collect
* @param visitLimit how many vector nodes the results are allowed to visit
*/
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, BlockingFloatHeap 
globalSimilarityQueue) {
 super(k, visitLimit);
+this.greediness = DEFAULT_GREEDINESS;
 this.queue = new NeighborQueue(k, false);
+this.globalSimilarityQueue = globalSimilarityQueue;
+
+if (globalSimilarityQueue == null) {
+  this.nonCompetitiveQueue = null;
+  this.updatesQueue = null;
+} else {
+  this.nonCompetitiveQueue = new FloatHeap(Math.max(1, Math.round((1 - 
greediness) * k)));
+  this.updatesQueue = new FloatHeap(k);
+}
   }
 
   @Override
   public boolean collect(int docId, float similarity) {
-return queue.insertWithOverflow(docId, similarity);
+boolean localSimUpdated = queue.insertWithOverflow(docId, similarity);
+boolean firstKResultsCollected = (kResultsCollected == false && 
queue.size() == k());
+if (firstKResultsCollected) {
+  kResultsCollected = true;
+}
+
+boolean globalSimUpdated = false;
+if (globalSimilarityQueue != null) {
+  updatesQueue.offer(similarity);
+  globalSimUpdated = nonCompetitiveQueue.offer(similarity);
+
+  if (kResultsCollected) {
+// as we've collected k results, we can start do periodic updates with 
the global queue
+if (firstKResultsCollected || (visitedCount & interval) == 0) {
+  cachedGlobalMinSim = 
globalSimilarityQueue.offer(updatesQueue.getHeap());

Review Comment:
   @jimczi Sorry I did not understand this comment, can you please clarify it. 
   
   What does it mean `k` is close to the size of queue?  The 
`globalSimilarityQueue` is of size `k`. 
   
   Also I am not clear how you derive the value of `24`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Rollback the tmp storage of BytesRefHash to -1 after sort [lucene]

2024-01-15 Thread via GitHub


gf2121 opened a new pull request, #13014:
URL: https://github.com/apache/lucene/pull/13014

   In #12784 we cache buckets into extra slots in `BytesRefHash` to speed up 
`BytesRefHash#sort`. This causes an 
[AssertError](https://lists.apache.org/thread/lxk2lm88twfzdl549974vov1984kj995) 
for 
[`TermsQuery`](https://github.com/apache/lucene/blob/c746bea233fd14fe81d9805633fcce0f7b9681b6/lucene/join/src/java/org/apache/lucene/search/join/TermsQuery.java#L68)
 because it could call `sort` more than once on a `BytesRefHash` instance. Not 
sure if any other users were relying on this but it would be great to keep 
consistent with before.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org