Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]

2024-03-13 Thread via GitHub


dweiss merged PR #13177:
URL: https://github.com/apache/lucene/pull/13177


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13178:
URL: https://github.com/apache/lucene/pull/13178#issuecomment-1993797886

   I'm curious what problem you are trying to address. It looks like you're 
trying to avoid the overhead of checking the number of clauses, but intuitively 
this wouldn't help much as we have other operations that need to run in linear 
time with the number of clauses in the query anyway such as creating the query 
in the first place, rewriting it, creating a weight, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make BP work on indexes that have blocks. [lucene]

2024-03-13 Thread via GitHub


jpountz merged PR #13125:
URL: https://github.com/apache/lucene/pull/13125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Improve Lucene's I/O concurrency [lucene]

2024-03-13 Thread via GitHub


jpountz opened a new issue, #13179:
URL: https://github.com/apache/lucene/issues/13179

   ### Description
   
   Currently, Lucene's I/O concurrency is bound by the search concurrency. If 
`IndexSearcher` runs on N threads, then Lucene will never perform more than N 
I/Os concurrently. Unless you significantly overprovision your search thread 
pool - which is bad for other reasons, Lucene will bottleneck on I/O latency 
without even maxing out the IOPS of the host.
   
   I don't think that Lucene should fully embrace asynchronousness in its APIs, 
or query evaluation would become overly complicated. But I still expect that we 
have a lot of room for improvement to allow each search thread to perform 
multiple I/Os concurrently under the hood when needed.
   
   Some examples:
- When running a query on two terms, e.g. `apache OR lucene`, could the I/O 
lookups in the `tim` file (terms dictionary) be performed concurrently for both 
terms?
- When running a query on two terms and start offsets in the `doc` file 
(postings) have been resolved, could we start loading the first bytes from 
these postings lists from disk concurrently?
- When fetching the top N=100 stored documents that match a query, could we 
load bytes from the `fdt` file (stored fields) for all these documents 
concurrently?
   
   This would require API changes in our `Directory` APIs, and some low-level 
`IndexReader` APIs (`TermsEnum`, `StoredFieldsReader`?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13153:
URL: https://github.com/apache/lucene/pull/13153#issuecomment-1994160216

   Have you seen interesting performance numbers with this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1523142360


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   > Oh I see, you haven't yet put in the change for auto figure out number of 
workers
   
   I am wanting to do that separately. It will require its own benchmarking, 
etc. and this change is good even without choosing an automatic split for the 
hnsw graph merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce IORunnable to fix failure in TestIndexWriterOnDiskFull.testAddIndexOnDiskFull [lucene]

2024-03-13 Thread via GitHub


easyice commented on PR #13172:
URL: https://github.com/apache/lucene/pull/13172#issuecomment-1994345997

   Thanks @dweiss , I found a seed `837FF885325AC743` that can reproduce this 
failure on the main branch, but it's not stable, it may fail once in about 10 
runs. with this patch, I ran it 100 times and it was all successful.
   
   `-Dtests.seed=837FF885325AC743 -Dtests.nightly=true`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994372130

   @zhaih @jpountz I am going to create a separate issue around making HNSW 
worker slicing automatic. It will require a bunch of its own benchmarking and 
work and honestly seems orthogonal to this particular change as HNSW merging 
still works fine when statically setting the number of workers & using the CMS 
thread pool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994373820

   Thanks for sharing performance numbers @benwtrent, very interesting. Also 
double checking if you saw my above comment: 
https://github.com/apache/lucene/pull/13124#pullrequestreview-1930418114, I'd 
have more confidence about the way merge concurrency is exposed if we took 
advantage of it in more than one place. If there's some difficulty with it, I'm 
happy with looking into it in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994399018

   @jpountz would you prefer something like the original patch from @dweiss ? I 
can submit the merging actions independently to the intra-merge executor.
   
   Anything more (like figuring out how to make postings merge truly parallel), 
will be quite a lift for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994458100

   Yes exactly, something very simple, mostly to exercise intra-merge 
concurrency with more than just vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add new VectorScorer interface to vector value iterators [lucene]

2024-03-13 Thread via GitHub


benwtrent opened a new pull request, #13181:
URL: https://github.com/apache/lucene/pull/13181

   With quantized vectors, and with current vectors, we separate out the 
"scoring" vs. "iteration", requiring the user to always iterate the raw vectors 
and provide their own similarity function.
   
   While this is flexible, it creates frustration in:
   
- Just iterating and scoring, especially since the field already has a 
similarity function stored...Why can't we just know which one to use and use it!
- Iterating and scoring quantized vectors. By default it would be good to 
be able to iterate and score quantized vectors (e.g. without going through the 
HNSW graph).
   
   I see two options on providing this:
   
- A new top level thing on the LeafReader (getVectorScorer or something).
- Extend the vector value iterators to be able to return a scorer given 
some vector value (what this PR demonstrates).
   
   This is a POC, not all interfaces are supplied. I am opening what I have for 
discussion.
   
   Folks who might be interested: @jpountz @msokolov @mccullocht 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Regarding the frequency used for scoring sloppy phrase queries. [lucene]

2024-03-13 Thread via GitHub


jpountz commented on issue #13152:
URL: https://github.com/apache/lucene/issues/13152#issuecomment-1994488165

   I agree that the penalty feels too high. We have challenges with queries 
like this one because there is no good theoretical basis for the right way to 
score such queries (at least to my knowledge), which forces to make guesses. 
The one that is implemented at the moment is certainly not great, but at least 
it's parameter-free. I would imagine that something like `k / (k + 
matchLength)` would work better for you but then it requires us to expose a 
configuration option which isn't great either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use group-varint encode the positions [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #12842:
URL: https://github.com/apache/lucene/pull/12842#issuecomment-1994545075

   It looks like `writeGroupVInt` has room for improvement. Can we improve it 
by making it look a bit more like the read logic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1994666062

   Would we be subject to the same issue if/when 3+ different implementations 
of `DocIdSetIterator` get used in `IntersectVisitor#visit`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Making vector comparisons pluggable [lucene]

2024-03-13 Thread via GitHub


benwtrent opened a new issue, #13182:
URL: https://github.com/apache/lucene/issues/13182

   ### Description
   
   Opening an issue to continue discussion originating here: 
https://github.com/apache/lucene/pull/13076#issuecomment-1930363479
   
   Making vector similarities pluggable via SPI will enable users to provide 
their own specialized similarities without the additional burden of Lucene core 
having to provide BWC for all the various similarity functions (e.g. hamming, 
jaccard, cosine).
   
   It is probably best that the plug-in-play aspect is placed on `FieldInfo`, 
though this would be a bit of work as `FieldInfo` isn't currently pluggable. 
Attaching it directly to a particular Vector Format would place undue burden on 
users, requiring a new format for any field that desires a separate similarity.
   
   While I am not 100% sure how to add it to FieldInfo, I do want to try and 
figure out the API for such a change. 
   
   When used within a particular vector format, the following scenarios would 
be useful:
- Indexing, comparing on-heap vectors accessible via ordinals
- Merging, comparing off-heap vectors, possibly reading them directly 
on-heap via ordinals
- During search, an on-heap user provided vector being compared with 
off-heap vectors via ordinals (potentially reading them on-heap).
   
   Some of the optimizations discussed here: 
https://github.com/apache/lucene/pull/12703 show some significant gains in 
being able to simply have a memory segment and an ordinal offset. While we are 
not there yet in Lucene, it indicates that we shouldn’t force the API to 
reading off-heap vectors into a `float[]` or `byte[]` arrays..
   
   I was thinking of something *like* this. 100%, this is not finalized, just 
wanting to start the discussion.
   
   
   
   public abstract class VectorSimilarityInterface implements 
NamedSPILoader.NamedSPI {
 
 private static final class Holder {
   private static final NamedSPILoader LOADER =
 new NamedSPILoader<>(VectorSimilarityInterface.class);
   private Holder() {}
   static NamedSPILoader getLoader() {
 if (LOADER == null) {
   throw new IllegalStateException(
 "You tried to lookup a VectorSimilarityInterface name before all 
formats could be initialized. "
   + "This likely happens if you call 
VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor.");
 }
 return LOADER;
   }
 }
 
 public static VectorSimilarityInterface forName(String name) {
   return VectorSimilarityInterface.Holder.getLoader().lookup(name);
 }
 
 private final String name;
 protected VectorSimilarityInterface(String name) {
   NamedSPILoader.checkServiceName(name);
   this.name = name;
 }
 @Override
 public String getName() {
   return name;
 }
 
 // Comparing an "on heap" query with vectorValues that may or may not be 
on-heap
 // Maybe we don't need this and the `byte[]` version as we could hide the 
"on-heap query"
 // in an "IdentityRandomAccessVectorValues" which only returns the query 
vector...
 public abstract VectorScorer 
getVectorScorer(RandomAccessVectorValues vectorValues, float[] target) 
throws Exception;
 
 public abstract VectorComparator 
getFloatVectorComparator(RandomAccessVectorValues vectorValues) throws 
Exception;
 
 public abstract VectorScorer 
getVectorScorer(RandomAccessVectorValues vectorValues, byte[] target) 
throws Exception;
 
 public abstract VectorComparator 
getByteVectorComparator(RandomAccessVectorValues vectorValues) throws 
Exception;
 static interface VectorScorer extends Closeable {
   float score(int targetOrd);
 }
 
 static interface VectorComparator {
   float compare(int vectorOrd1, int vectorOrd2);
 }
   }
   
   
   
   
   It looks like the SPI injection could occur in `FieldInfosFormat#read` & 
`FieldInfosFormat#write` (though a new one would have to be built 
`Lucene911FieldInfosFormat` or something).
   
   This would also include a new codec as the field format will change.
   
   I am not 100% sold on how this API looks myself. I don't think 
`RandomAccessVectorValues` is 100% the correct API as it either exposes too 
much (e.g. `ordToDoc`) or too little (for off-heap, we don't get access to the 
MemorySegment nor files...).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994921965

   > Yes exactly, something very simple, mostly to exercise intra-merge 
concurrency with more than just vectors.
   
   Latest commit adds `TaskExecutor` actions to merge to allow different 
merging field types to be merged in parallel. 
   
   I needed to use `TaskExecutor` as it abstracts out the exception handling 
(all these methods throw `IOException`).
   
   I still need to benchmark it with Lucene-util


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1523620861


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   Yeah I agree, so for now we'll just fix it at 8?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-13 Thread via GitHub


antonha commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1995011184

   > Would we be subject to the same issue if/when 3+ different implementations 
of `DocIdSetIterator` get used in `IntersectVisitor#visit`?
   
   Yes.
   
   Your question makes me think that maybe we should not add a new 
`DocIdSetIterator` for this PR - maybe that was your thought as well? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-13 Thread via GitHub


jpountz commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1995156682

   Relatedly indeed, I was wondering if the API should expose an IntsRef or 
something like that, so that there is a single virtual call per block of doc 
IDs anyway (IntsRef cannot be extended, it's a single impl).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub


benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1995183835

   @jpountz ok, that naive attempt failed as norms & terms apparently need to 
be merged in order (one or the other would fail due to missing files...). 
   
   I am not sure if this is true with other things. But when I adjusted for 
this (latest commit) I did see a speed improvement.
   
   My test was over IndexGeoNames, I did no segment merging and flushed every 
12MB to ensure I got many segments. I also set CMS merge threads to `8` 
threads. 
   
   Baseline forcemerge(1): 17793 ms
   Candidates forcemerge(1): 13739 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]

2024-03-13 Thread via GitHub


rishabhmaurya commented on code in PR #13123:
URL: https://github.com/apache/lucene/pull/13123#discussion_r1523704731


##
lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:
##
@@ -341,116 +344,94 @@ protected void compute() {
  */
 private boolean shuffle(
 ForwardIndex forwardIndex,
-IntsRef left,
-IntsRef right,
+IntsRef docIDs,
+int midPoint,
 int[] leftDocFreqs,
 int[] rightDocFreqs,
-float[] gains,
+float[] biases,
 int iter)
 throws IOException {
-  assert left.ints == right.ints;
-  assert left.offset + left.length == right.offset;
 
-  // Computing gains is typically a bottleneck, because each iteration 
needs to iterate over all
-  // postings to recompute gains, and the total number of postings is 
usually one order of
+  // Computing biases is typically a bottleneck, because each iteration 
needs to iterate over
+  // all postings to recompute biases, and the total number of postings is 
usually one order of
   // magnitude or more larger than the number of docs. So we try to 
parallelize it.
-  ComputeGainsTask leftGainsTask =
-  new ComputeGainsTask(
-  left.ints,
-  gains,
-  left.offset,
-  left.offset + left.length,
+  new ComputeBiasTask(
+  docIDs.ints,
+  biases,
+  docIDs.offset,
+  docIDs.offset + docIDs.length,
   leftDocFreqs,
   rightDocFreqs,
   threadLocal,
-  depth);
-  ComputeGainsTask rightGainsTask =
-  new ComputeGainsTask(
-  right.ints,
-  gains,
-  right.offset,
-  right.offset + right.length,
-  rightDocFreqs,
-  leftDocFreqs,
-  threadLocal,
-  depth);
-  if (shouldFork(docIDs.length, docIDs.ints.length)) {
-invokeAll(leftGainsTask, rightGainsTask);
-  } else {
-leftGainsTask.compute();
-rightGainsTask.compute();
+  depth)
+  .compute();
+
+  float maxLeftBias = Float.NEGATIVE_INFINITY;
+  for (int i = docIDs.offset; i < midPoint; ++i) {
+maxLeftBias = Math.max(maxLeftBias, biases[i]);
+  }
+  float minRightBias = Float.POSITIVE_INFINITY;
+  for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; 
++i) {
+minRightBias = Math.min(minRightBias, biases[i]);
+  }
+  float gain = maxLeftBias - minRightBias;
+  // This uses the simulated annealing proposed by Mackenzie et al in 
"Tradeoff Options for
+  // Bipartite Graph Partitioning" by comparing the gain of swapping the 
doc from the left side
+  // that is most attracted to the right and the doc from the right side 
that is most attracted
+  // to the left against `iter` rather than zero.
+  if (gain <= iter) {
+return false;
   }
 
-  class ByDescendingGainSorter extends IntroSorter {
+  new IntroSelector() {
 
 int pivotDoc;
-float pivotGain;
+float pivotBias;
 
 @Override
 protected void setPivot(int i) {
-  pivotDoc = left.ints[i];
-  pivotGain = gains[i];
+  pivotDoc = docIDs.ints[i];
+  pivotBias = biases[i];
 }
 
 @Override
 protected int comparePivot(int j) {
-  // Compare in reverse order to get a descending sort
-  int cmp = Float.compare(gains[j], pivotGain);
+  int cmp = Float.compare(pivotBias, biases[j]);

Review Comment:
   @jpountz let me know your thoughts on it, happy to contribute and improve it 
further. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-13 Thread via GitHub


antonha commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-1996041622

   @jpountz I had a quick look at the code, and it seems to me like there are, 
with this PR, only two implementations used for the DISI used for the 
`IntersectVisitor#visit` method - which is good in the sense that it would 
probably be fine to merge it now, but bad in the sense that it would make 
adding a third hurt performance.
   
   Do we believe that we can merge this PR and then continue with changing the 
BKD visit API in a later change, or should we try to change the abstraction in 
this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]

2024-03-13 Thread via GitHub


rmuir commented on PR #13177:
URL: https://github.com/apache/lucene/pull/13177#issuecomment-1996150774

   Thanks for doing this. I was disappointed that java doesn't allow this:
   ```
   jshell> ProcessHandle.current().destroyForcibly()
   |  Exception java.lang.IllegalStateException: destroy of current process not 
allowed
   |at ProcessHandleImpl.destroyProcess (ProcessHandleImpl.java:369)
   |at ProcessHandleImpl.destroyForcibly (ProcessHandleImpl.java:392)
   |at (#2:1)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]

2024-03-13 Thread via GitHub


github-actions[bot] commented on PR #13142:
URL: https://github.com/apache/lucene/pull/13142#issuecomment-1996171442

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Lucene90 postings format to write FST off heap [lucene]

2024-03-13 Thread via GitHub


github-actions[bot] commented on PR #12985:
URL: https://github.com/apache/lucene/pull/12985#issuecomment-1996171619

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org