Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
ChrisHegarty commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1687664305 ## lucene/core/src/java21/org/apache/lucene/store/RefCountedSharedArena.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
ChrisHegarty commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1687665071 ## lucene/core/src/java21/org/apache/lucene/store/RefCountedSharedArena.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
ChrisHegarty commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1687726079 ## lucene/core/src/java21/org/apache/lucene/store/RefCountedSharedArena.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-07-23 Thread via GitHub
tteofili commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2244713847 this [paper](https://dl.acm.org/doi/abs/10.1145/3626772.3657906) from SIGIR'24 seems to do exactly this as a first step in their Block Max Pruning technique. -- This is an autom

Re: [PR] Take advantage of the doc value skipper when it is primary sort [lucene]

2024-07-23 Thread via GitHub
iverase commented on PR #13592: URL: https://github.com/apache/lucene/pull/13592#issuecomment-2244734279 I tried to find a generic method that could help here but I think the logic relies too much on the fact that the index is sorted. For example, your mental model somewhat breaks if

Re: [PR] Compute facets while collecting [lucene]

2024-07-23 Thread via GitHub
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1687770185 ## lucene/demo/src/java/org/apache/lucene/demo/facet/SandboxFacetsExample.java: ## @@ -0,0 +1,714 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one

[PR] Update TestTopDocsCollector to no longer rely on search(Query, Collector) [lucene]

2024-07-23 Thread via GitHub
javanna opened a new pull request, #13600: URL: https://github.com/apache/lucene/pull/13600 IndexSearcher#search(Query, Collector) is deprecated and leftover usages should be removed. This addresses one usage in TestTopDocsCollector. -- This is an automated message from the Apache Git Ser

[PR] Update TestTopDocsMerge to not rely on search(Query, Collector) [lucene]

2024-07-23 Thread via GitHub
javanna opened a new pull request, #13601: URL: https://github.com/apache/lucene/pull/13601 IndexSearcher#search(Query, Collector) is deprecated and leftover usages should be removed. This addresses one usage in TestTopDocsCollector. -- This is an automated message from the Apache Git

Re: [I] Try applying bipartite graph reordering to KNN graph node ids [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on issue #13565: URL: https://github.com/apache/lucene/issues/13565#issuecomment-2244785292 > This is what got me to thinking of BP for HNSW search: intuitively, it could help a lot when the dataset size exceeds the size of the page cache? I think that gains might

[PR] Udpate ReadTask to not rely on search(Query, Collector) [lucene]

2024-07-23 Thread via GitHub
javanna opened a new pull request, #13602: URL: https://github.com/apache/lucene/pull/13602 This commit modifies ReadTask to no longer call the deprecated search(Query, Collector). Instead, it creates a collector manager and calls search(Query, CollectorManager). The existing protect

Re: [PR] Udpate ReadTask to not rely on search(Query, Collector) [lucene]

2024-07-23 Thread via GitHub
javanna commented on code in PR #13602: URL: https://github.com/apache/lucene/pull/13602#discussion_r1687812812 ## lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java: ## @@ -46,17 +46,17 @@ public boolean withCollector() { } @

Re: [PR] Udpate ReadTask to not rely on search(Query, Collector) [lucene]

2024-07-23 Thread via GitHub
javanna commented on code in PR #13602: URL: https://github.com/apache/lucene/pull/13602#discussion_r1687812812 ## lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java: ## @@ -46,17 +46,17 @@ public boolean withCollector() { } @

[PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-23 Thread via GitHub
javanna opened a new pull request, #13603: URL: https://github.com/apache/lucene/pull/13603 There's a couple of places in the codebase where we extend IndexSearcher to customize per leaf behaviour, and in order to do that, we need to override the entire search method that loops through the

Re: [PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-23 Thread via GitHub
javanna commented on code in PR #13603: URL: https://github.com/apache/lucene/pull/13603#discussion_r1687837747 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -694,40 +695,56 @@ protected void search(List leaves, Weight weight, Collector c // th

Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on code in PR #13054: URL: https://github.com/apache/lucene/pull/13054#discussion_r1687872625 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java: ## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundat

Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-07-23 Thread via GitHub
benwtrent commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2244972179 @vigyasharma > do we have any existing benchmarks for ParentJoin queries in knn? No, we do not. I ended up writing a bunch of throw away code to benchmark latency and rec

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
ChrisHegarty commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1687920309 ## lucene/core/src/java21/org/apache/lucene/store/RefCountedSharedArena.java: ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
uschindler commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1687970721 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -83,6 +93,26 @@ public class MMapDirectory extends FSDirectory { */ public static f

[PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
mayya-sharipova opened a new pull request, #13604: URL: https://github.com/apache/lucene/pull/13604 Implement Kmeans clustering algorithm for vectors. Knn algorithms that further reduce memory usage of vectors (such as Product Quantization, RaBitQ etc) require clustering of vectors

Re: [PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
mayya-sharipova commented on PR #13604: URL: https://github.com/apache/lucene/pull/13604#issuecomment-2245155916 I did [benchmarking](https://github.com/mayya-sharipova/kmeans-test) on MNIST dataset and compared the accuracy with [KMeans algorithm](https://github.com/mayya-sharipova/kmeans

Re: [PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on PR #13604: URL: https://github.com/apache/lucene/pull/13604#issuecomment-2245202448 Whoa, cool! What is (roughly) the run-time of KMeans as a function of number of vectors? Do you tell it how many clusters to create, or, do you ask it to keep splitting into more cl

Re: [PR] Aggregate files from the same segment into a single Arena [lucene]

2024-07-23 Thread via GitHub
uschindler commented on code in PR #13570: URL: https://github.com/apache/lucene/pull/13570#discussion_r1688093358 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java: ## @@ -125,4 +135,77 @@ private final MemorySegment[] map( } retur

Re: [I] Remove all deprecated IndexSearcher#search(Query, Collector) usage / methods in the next major release [lucene]

2024-07-23 Thread via GitHub
javanna commented on issue #12892: URL: https://github.com/apache/lucene/issues/12892#issuecomment-2245377653 I have been looking into this, there are unfortunately ~85 leftover usages of this method. Would be great to clean this up for Lucene 10. I opened a few small PRs around this.

Re: [PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
benwtrent commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1688239085 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,344 @@ +/* + * Licensed to the Apache Software Foundation (ASF) u

Re: [PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
benwtrent commented on PR #13604: URL: https://github.com/apache/lucene/pull/13604#issuecomment-2245534266 @mikemccand The runtime depends on configured number of iterations, restarts, sample size, and cluster count. But, it can be very fast. I will leave mMayya to talk about some s

Re: [PR] WIP: draft of intra segment concurrency [lucene]

2024-07-23 Thread via GitHub
javanna commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2245595764 Pinging @gsmiller as well around the challenges adjusting the facets code I mentioned [above](https://github.com/apache/lucene/pull/13542#issuecomment-2243620253). I've seen collector m

Re: [PR] Delegating the matches in PointRangeQuery weight to relate method [lucene]

2024-07-23 Thread via GitHub
harshavamsi commented on PR #13599: URL: https://github.com/apache/lucene/pull/13599#issuecomment-2245815063 > I would expect this change to result in a slow down for this type of queries. > > You are proposing to replace the current implementation with a slower one that computes the

Re: [I] Deprecate `COSINE` before Lucene 10 release [lucene]

2024-07-23 Thread via GitHub
benwtrent commented on issue #13281: URL: https://github.com/apache/lucene/issues/13281#issuecomment-2245953147 @msokolov @jmazanec15 I don't know of many `int8` models/datasets out there that require cosine. But, I did a benchmark with Cohere's int8 embeddings here: https://hugging

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on code in PR #13585: URL: https://github.com/apache/lucene/pull/13585#discussion_r1688514148 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForUtil.java: ## @@ -0,0 +1,1148 @@ +// This file has been automatically generated, DO NOT EDIT + +/* + *

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
jpountz commented on code in PR #13585: URL: https://github.com/apache/lucene/pull/13585#discussion_r1688593960 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForUtil.java: ## @@ -0,0 +1,1148 @@ +// This file has been automatically generated, DO NOT EDIT + +/* + * Li

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
jpountz commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2246112137 Skip data at level 0 now stores pointers into pos/pay files instead of incrementing posPendingCount by the total term freq of the block. This seems to slow down term queries marginally a

Re: [PR] KMeans clustering algorithm [lucene]

2024-07-23 Thread via GitHub
mayya-sharipova commented on code in PR #13604: URL: https://github.com/apache/lucene/pull/13604#discussion_r1688624775 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/quantization/KMeans.java: ## @@ -0,0 +1,344 @@ +/* + * Licensed to the Apache Software Foundation (

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2246348961 > Also I noticed we would sometimes decode the same block of positions multiple times when it's shared by two doc blocks (because when moving to the next doc block we reset the positi

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on PR #13585: URL: https://github.com/apache/lucene/pull/13585#issuecomment-2246352717 Do you have any measure of how many bytes in a big posting is spent on skip data vs doc/freq blocks? The gains on the last benchy look awesome! It's surprising `CountOrHighHig

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on code in PR #13585: URL: https://github.com/apache/lucene/pull/13585#discussion_r1688752555 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/ForUtil.java: ## @@ -0,0 +1,1148 @@ +// This file has been automatically generated, DO NOT EDIT + +/* + *

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-23 Thread via GitHub
mikemccand commented on code in PR #13585: URL: https://github.com/apache/lucene/pull/13585#discussion_r1688761798 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsWriter.java: ## @@ -0,0 +1,597 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Compute facets while collecting [lucene]

2024-07-23 Thread via GitHub
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1688796043 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/ranges/LongRangeFacetCutter.java: ## @@ -0,0 +1,431 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Compute facets while collecting [lucene]

2024-07-23 Thread via GitHub
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1688796852 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/ranges/LongRangeFacetCutter.java: ## @@ -0,0 +1,431 @@ +/* + * Licensed to the Apache Software Foundation (A

Re: [PR] Compute facets while collecting [lucene]

2024-07-23 Thread via GitHub
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1688952591 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/ranges/IntervalTracker.java: ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) u

Re: [PR] Binary search all terms. [lucene]

2024-07-23 Thread via GitHub
github-actions[bot] commented on PR #13192: URL: https://github.com/apache/lucene/pull/13192#issuecomment-2246620817 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Compute facets while collecting [lucene]

2024-07-23 Thread via GitHub
epotyom commented on code in PR #13568: URL: https://github.com/apache/lucene/pull/13568#discussion_r1688962446 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/facet/ranges/RangeOrdLabelBiMap.java: ## @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

Re: [PR] Take advantage of the doc value skipper when it is primary sort [lucene]

2024-07-23 Thread via GitHub
iverase commented on PR #13592: URL: https://github.com/apache/lucene/pull/13592#issuecomment-2247035303 I introduced the method DocValuesSkipper#advance(long,long) to advance the iterator to a matching range. wdyt? -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] Delegating the matches in PointRangeQuery weight to relate method [lucene]

2024-07-23 Thread via GitHub
iverase commented on PR #13599: URL: https://github.com/apache/lucene/pull/13599#issuecomment-2247049079 If we look at the current implementation of matches and relates, they both iterate over the dimensions and they both check if the dimension is disjoint. If that is true, then they bail o