date:20240730

Re: [PR] Udpate ReadTask to not rely on search(Query, Collector) [lucene]

2024-07-30 Thread via GitHub



javanna commented on code in PR #13602:
URL: https://github.com/apache/lucene/pull/13602#discussion_r1696627499


##
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/SearchWithCollectorTask.java:
##
@@ -46,17 +46,17 @@ public boolean withCollector() {
   }
 
   @Override
-  protected Collector createCollector() throws Exception {
-Collector collector = null;
+  protected CollectorManager createCollectorManager() throws Exception {
+CollectorManager collectorManager;
 if (clnName.equalsIgnoreCase("topScoreDoc") == true) {
-  collector = TopScoreDocCollector.create(numHits(), Integer.MAX_VALUE);
+  collectorManager = new TopScoreDocCollectorManager(numHits(), 
Integer.MAX_VALUE);
 } else if (clnName.length() > 0) {
-  collector = 
Class.forName(clnName).asSubclass(Collector.class).getConstructor().newInstance();
-
+  collectorManager =
+  
Class.forName(clnName).asSubclass(CollectorManager.class).getConstructor().newInstance();
 } else {
-  collector = super.createCollector();
+  collectorManager = super.createCollectorManager();
 }
-return collector;
+return collectorManager;

Review Comment:
   I have now added a changes entry, and replaced collector.class with a new 
config parameter. We throw error once collector.class is provided, so we fail 
fast and users know what they have to do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Udpate ReadTask to not rely on search(Query, Collector) [lucene]

2024-07-30 Thread via GitHub



javanna commented on code in PR #13602:
URL: https://github.com/apache/lucene/pull/13602#discussion_r1696628573


##
lucene/benchmark/conf/collector.alg:
##
@@ -17,11 +17,10 @@
 # 
-
 # multi val params are iterated by NewRound's, added to reports, start with 
column name.
 
-# collector.class can be:
-#Fully Qualified Class Name of a Collector with a empty constructor
-#topScoreDocOrdered - Creates a TopScoreDocCollector that requires in 
order docs
-#topScoreDocUnordered - Like above, but allows out of order
-collector.class=coll:topScoreDoc
+# collector.manager.class can be:
+#Fully Qualified Class Name of a CollectorManager with a empty constructor
+#topScoreDoc - Creates a TopScoreDocCollectorManager
+collector.manager.class=coll:topScoreDoc

Review Comment:
   This looked like it needed updating following my changes, but I could not 
tell where these files are loaded, and what I should run to verify that my 
update works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



javanna commented on PR #13603:
URL: https://github.com/apache/lucene/pull/13603#issuecomment-2257899247

   Thanks a lot for your review @gsmiller , much appreciated, and for the 
reminder around the CHANGES entry. Would you like to review the wording 
perhaps? Otherwise, this should be good to go. My thinking is that it can be 
backported to branch_9x as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Remove the @Deprecated methods from TopScoreDocCollector and TopFieldCollector [lucene]

2024-07-30 Thread via GitHub



javanna closed issue #13499: Remove the @Deprecated methods from 
TopScoreDocCollector and TopFieldCollector
URL: https://github.com/apache/lucene/issues/13499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Removing all deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager) [lucene]

2024-07-30 Thread via GitHub



javanna merged PR #13617:
URL: https://github.com/apache/lucene/pull/13617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Remove the @Deprecated methods from TopScoreDocCollector and TopFieldCollector [lucene]

2024-07-30 Thread via GitHub



javanna closed issue #13499: Remove the @Deprecated methods from 
TopScoreDocCollector and TopFieldCollector
URL: https://github.com/apache/lucene/issues/13499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Removing all deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager) [lucene]

2024-07-30 Thread via GitHub



javanna commented on PR #13617:
URL: https://github.com/apache/lucene/pull/13617#issuecomment-2257918667

   Thanks again @slow-J . Heads up, I moved the changes entry to the "API 
changes" section above after merging, I did not catch this while reviewing :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



epotyom commented on PR #13559:
URL: https://github.com/apache/lucene/pull/13559#issuecomment-2258002016

   I've added new `BitSet#firstSetBitInRange` method with maybe naive 
implementations for both `SparseFixedBitSet` and `FixedBitSet`. I've also 
changed `BlockJoinSelector` to use it. `luceneutil` benchmark results seem to 
be neutral, but as far as I understand there are no tasks that trigger 
`BlockJoinSelector`?
   
   I'll run our internal tests to see if it helps, and will add a line to 
CHANGES if otherwise the change looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Removing all deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager) [lucene]

2024-07-30 Thread via GitHub



slow-J commented on PR #13617:
URL: https://github.com/apache/lucene/pull/13617#issuecomment-2258011155

   My bad, sorry for the mistake, thanks for catching it and the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13585:
URL: https://github.com/apache/lucene/pull/13585#discussion_r1696774526


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java:
##
@@ -0,0 +1,2028 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import static org.apache.lucene.codecs.lucene912.ForUtil.BLOCK_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.DOC_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.PAY_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.POS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.SKIP_TOTAL_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.TERMS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_CURRENT;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_START;
+
+import java.io.IOException;
+import java.util.AbstractList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.RandomAccess;
+import org.apache.lucene.codecs.BlockTermState;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.index.ImpactsEnum;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SlowImpactsEnum;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Concrete class that reads docId(maybe frq,pos,offset,payloads) list with 
postings format.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene912PostingsReader extends PostingsReaderBase {
+
+  private final IndexInput docIn;
+  private final IndexInput posIn;
+  private final IndexInput payIn;
+
+  private final int version;
+
+  /** Sole constructor. */
+  public Lucene912PostingsReader(SegmentReadState state) throws IOException {
+boolean success = false;
+IndexInput docIn = null;
+IndexInput posIn = null;
+IndexInput payIn = null;
+
+// NOTE: these data files are too costly to verify checksum against all 
the bytes on open,
+// but for now we at least verify proper structure of the checksum footer: 
which looks
+// for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms 
of corruption
+// such as file truncation.
+
+String docName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.DOC_EXTENSION);
+try {
+  // Postings have a forward-only access pattern, so pass 
ReadAdvice.NORMAL to perform
+  // readahead.
+  docIn = state.directory.openInput(docName, 
state.context.withReadAdvice(ReadAdvice.NORMAL));
+  version =
+  CodecUtil.checkIndexHeader(
+  docIn,
+  DOC_CODEC,
+  VERSION_START,
+  VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.retrieveChecksum(docIn);
+
+  if (state.fieldInfos.hasProx()) {
+String proxName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.POS_EXTENSION);
+posIn = state.directory.openInput(proxName, state.context);
+CodecUtil.checkIndexHeader(
+posIn, POS_CODEC, version, v

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13585:
URL: https://github.com/apache/lucene/pull/13585#discussion_r1696776133


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java:
##
@@ -0,0 +1,2028 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import static org.apache.lucene.codecs.lucene912.ForUtil.BLOCK_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.DOC_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.PAY_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.POS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.SKIP_TOTAL_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.TERMS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_CURRENT;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_START;
+
+import java.io.IOException;
+import java.util.AbstractList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.RandomAccess;
+import org.apache.lucene.codecs.BlockTermState;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.index.ImpactsEnum;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SlowImpactsEnum;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Concrete class that reads docId(maybe frq,pos,offset,payloads) list with 
postings format.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene912PostingsReader extends PostingsReaderBase {
+
+  private final IndexInput docIn;
+  private final IndexInput posIn;
+  private final IndexInput payIn;
+
+  private final int version;
+
+  /** Sole constructor. */
+  public Lucene912PostingsReader(SegmentReadState state) throws IOException {
+boolean success = false;
+IndexInput docIn = null;
+IndexInput posIn = null;
+IndexInput payIn = null;
+
+// NOTE: these data files are too costly to verify checksum against all 
the bytes on open,
+// but for now we at least verify proper structure of the checksum footer: 
which looks
+// for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms 
of corruption
+// such as file truncation.
+
+String docName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.DOC_EXTENSION);
+try {
+  // Postings have a forward-only access pattern, so pass 
ReadAdvice.NORMAL to perform
+  // readahead.
+  docIn = state.directory.openInput(docName, 
state.context.withReadAdvice(ReadAdvice.NORMAL));
+  version =
+  CodecUtil.checkIndexHeader(
+  docIn,
+  DOC_CODEC,
+  VERSION_START,
+  VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.retrieveChecksum(docIn);
+
+  if (state.fieldInfos.hasProx()) {
+String proxName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.POS_EXTENSION);
+posIn = state.directory.openInput(proxName, state.context);
+CodecUtil.checkIndexHeader(
+posIn, POS_CODEC, version, v

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13585:
URL: https://github.com/apache/lucene/pull/13585#discussion_r1696777219


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java:
##
@@ -0,0 +1,2028 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import static org.apache.lucene.codecs.lucene912.ForUtil.BLOCK_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.DOC_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.PAY_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.POS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.SKIP_TOTAL_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.TERMS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_CURRENT;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_START;
+
+import java.io.IOException;
+import java.util.AbstractList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.RandomAccess;
+import org.apache.lucene.codecs.BlockTermState;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.index.ImpactsEnum;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SlowImpactsEnum;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Concrete class that reads docId(maybe frq,pos,offset,payloads) list with 
postings format.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene912PostingsReader extends PostingsReaderBase {
+
+  private final IndexInput docIn;
+  private final IndexInput posIn;
+  private final IndexInput payIn;
+
+  private final int version;
+
+  /** Sole constructor. */
+  public Lucene912PostingsReader(SegmentReadState state) throws IOException {
+boolean success = false;
+IndexInput docIn = null;
+IndexInput posIn = null;
+IndexInput payIn = null;
+
+// NOTE: these data files are too costly to verify checksum against all 
the bytes on open,
+// but for now we at least verify proper structure of the checksum footer: 
which looks
+// for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms 
of corruption
+// such as file truncation.
+
+String docName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.DOC_EXTENSION);
+try {
+  // Postings have a forward-only access pattern, so pass 
ReadAdvice.NORMAL to perform
+  // readahead.
+  docIn = state.directory.openInput(docName, 
state.context.withReadAdvice(ReadAdvice.NORMAL));
+  version =
+  CodecUtil.checkIndexHeader(
+  docIn,
+  DOC_CODEC,
+  VERSION_START,
+  VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.retrieveChecksum(docIn);
+
+  if (state.fieldInfos.hasProx()) {
+String proxName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.POS_EXTENSION);
+posIn = state.directory.openInput(proxName, state.context);
+CodecUtil.checkIndexHeader(
+posIn, POS_CODEC, version, v

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13585:
URL: https://github.com/apache/lucene/pull/13585#discussion_r1696780288


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java:
##
@@ -0,0 +1,2028 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import static org.apache.lucene.codecs.lucene912.ForUtil.BLOCK_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.DOC_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.PAY_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.POS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.SKIP_TOTAL_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.TERMS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_CURRENT;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_START;
+
+import java.io.IOException;
+import java.util.AbstractList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.RandomAccess;
+import org.apache.lucene.codecs.BlockTermState;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.index.ImpactsEnum;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SlowImpactsEnum;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Concrete class that reads docId(maybe frq,pos,offset,payloads) list with 
postings format.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene912PostingsReader extends PostingsReaderBase {
+
+  private final IndexInput docIn;
+  private final IndexInput posIn;
+  private final IndexInput payIn;
+
+  private final int version;
+
+  /** Sole constructor. */
+  public Lucene912PostingsReader(SegmentReadState state) throws IOException {
+boolean success = false;
+IndexInput docIn = null;
+IndexInput posIn = null;
+IndexInput payIn = null;
+
+// NOTE: these data files are too costly to verify checksum against all 
the bytes on open,
+// but for now we at least verify proper structure of the checksum footer: 
which looks
+// for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms 
of corruption
+// such as file truncation.
+
+String docName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.DOC_EXTENSION);
+try {
+  // Postings have a forward-only access pattern, so pass 
ReadAdvice.NORMAL to perform
+  // readahead.
+  docIn = state.directory.openInput(docName, 
state.context.withReadAdvice(ReadAdvice.NORMAL));
+  version =
+  CodecUtil.checkIndexHeader(
+  docIn,
+  DOC_CODEC,
+  VERSION_START,
+  VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.retrieveChecksum(docIn);
+
+  if (state.fieldInfos.hasProx()) {
+String proxName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.POS_EXTENSION);
+posIn = state.directory.openInput(proxName, state.context);
+CodecUtil.checkIndexHeader(
+posIn, POS_CODEC, version, v

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13585:
URL: https://github.com/apache/lucene/pull/13585#discussion_r1696779655


##
lucene/core/src/java/org/apache/lucene/codecs/lucene912/Lucene912PostingsReader.java:
##
@@ -0,0 +1,2028 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import static org.apache.lucene.codecs.lucene912.ForUtil.BLOCK_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.DOC_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.PAY_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.POS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.SKIP_TOTAL_SIZE;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.TERMS_CODEC;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_CURRENT;
+import static 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.VERSION_START;
+
+import java.io.IOException;
+import java.util.AbstractList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.RandomAccess;
+import org.apache.lucene.codecs.BlockTermState;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.PostingsReaderBase;
+import 
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat.IntBlockTermState;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.Impact;
+import org.apache.lucene.index.Impacts;
+import org.apache.lucene.index.ImpactsEnum;
+import org.apache.lucene.index.IndexFileNames;
+import org.apache.lucene.index.IndexOptions;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SlowImpactsEnum;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.ReadAdvice;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.IOUtils;
+
+/**
+ * Concrete class that reads docId(maybe frq,pos,offset,payloads) list with 
postings format.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene912PostingsReader extends PostingsReaderBase {
+
+  private final IndexInput docIn;
+  private final IndexInput posIn;
+  private final IndexInput payIn;
+
+  private final int version;
+
+  /** Sole constructor. */
+  public Lucene912PostingsReader(SegmentReadState state) throws IOException {
+boolean success = false;
+IndexInput docIn = null;
+IndexInput posIn = null;
+IndexInput payIn = null;
+
+// NOTE: these data files are too costly to verify checksum against all 
the bytes on open,
+// but for now we at least verify proper structure of the checksum footer: 
which looks
+// for FOOTER_MAGIC + algorithmID. This is cheap and can detect some forms 
of corruption
+// such as file truncation.
+
+String docName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.DOC_EXTENSION);
+try {
+  // Postings have a forward-only access pattern, so pass 
ReadAdvice.NORMAL to perform
+  // readahead.
+  docIn = state.directory.openInput(docName, 
state.context.withReadAdvice(ReadAdvice.NORMAL));
+  version =
+  CodecUtil.checkIndexHeader(
+  docIn,
+  DOC_CODEC,
+  VERSION_START,
+  VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  CodecUtil.retrieveChecksum(docIn);
+
+  if (state.fieldInfos.hasProx()) {
+String proxName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene912PostingsFormat.POS_EXTENSION);
+posIn = state.directory.openInput(proxName, state.context);
+CodecUtil.checkIndexHeader(
+posIn, POS_CODEC, version, v

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



jpountz commented on PR #13585:
URL: https://github.com/apache/lucene/pull/13585#issuecomment-2258120692

   Thanks @mikemccand for taking a look at this large PR! I think I applied all 
your suggestions. The format docs should be up-to-date wrt how skip data is 
stored, and I did the codec dance (moving Lucene99PostingsFormat and 
Lucene99Codec to lucene/backward-codecs, making Lucene99PostingsFormat 
read-only, adding a test-only Lucene99RWPostingsFormat for testing, moving all 
our special terms dictionary formats to Lucene912PostingsFormat, moving the 
completion postings format to Lucene912PostingsFormat).
   
   Note that I removed `TestLazyProxSkipping`, which assumed separate skip data 
and postings.
   
   I plan on merging soon, as this PR is now in a state where conflicts easily 
arise and are hard to address. One cannot run luceneutil anymore as luceneutil 
complains that `Lucene99Codec` is missing, but my last run before doing the 
codec dance was very similar to the last one shared above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Release memory for cancelled tasks earlier in TaskExecutor [lucene]

2024-07-30 Thread via GitHub



javanna commented on PR #13609:
URL: https://github.com/apache/lucene/pull/13609#issuecomment-2258193070

   Ok , my hope was that we can somehow test cancellations with a scenario 
that's closer to reality, but I do see challenges. I also spent quite a bit of 
time on this and hit different issues that made testing very difficult.
   
   Perhaps we could at least write a test that fails without the change you 
made, around nulling memory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Release memory for cancelled tasks earlier in TaskExecutor [lucene]

2024-07-30 Thread via GitHub



original-brownbear commented on PR #13609:
URL: https://github.com/apache/lucene/pull/13609#issuecomment-2258283776

   @javanna 
   
   > Perhaps we could at least write a test that fails without the change you 
made, around nulling memory?
   
   I'm really sorry. I thought about this quite a bit, but outside of actually 
testing that the callables go out of scope via some `Cleaner` magic I think 
there's no way to do this, all of the functionality is internal to the task 
executor class. There's just not really any functional change here, it's just 
that I made the `Callable`s get GCed quicker and saved a little bit of 
indirection in general.
   I think this might be one of these things where we have to trust the code?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Deduplicate bytes for `FieldReader#rootCode` [lucene]

2024-07-30 Thread via GitHub



jpountz commented on code in PR #13610:
URL: https://github.com/apache/lucene/pull/13610#discussion_r1696951086


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java:
##
@@ -100,6 +99,14 @@ public final class FieldReader extends Terms {
  w.close();
  }
 */
+BytesRef emptyOutput = metadata.getEmptyOutput();
+if (rootCode.equals(emptyOutput) == false) {
+  // TODO: this branch is never taken?

Review Comment:
   Nit: Remove the question mark, which suggests we're not sure, while both the 
code and Mike suggest that they are indeed always the same?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



javanna merged PR #13603:
URL: https://github.com/apache/lucene/pull/13603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



javanna commented on PR #13603:
URL: https://github.com/apache/lucene/pull/13603#issuecomment-2258605434

   Thank you again for the review @gsmiller !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] [9.x] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



javanna opened a new pull request, #13619:
URL: https://github.com/apache/lucene/pull/13619

   Backport of #13603
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



gsmiller commented on PR #13603:
URL: https://github.com/apache/lucene/pull/13603#issuecomment-2258647048

   @javanna yeah no problem. All looks good to me. +1 to backporting to 9.12. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [9.x] Introduce IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector) method [lucene]

2024-07-30 Thread via GitHub



javanna merged PR #13619:
URL: https://github.com/apache/lucene/pull/13619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Inline skip data into postings lists [lucene]

2024-07-30 Thread via GitHub



mikemccand commented on PR #13585:
URL: https://github.com/apache/lucene/pull/13585#issuecomment-2258788196

   > Note that I removed `TestLazyProxSkipping`, which assumed separate skip 
data and postings.
   
   YAY!
   
   > I plan on merging soon, as this PR is now in a state where conflicts 
easily arise and are hard to address.
   
   +1 to merge now.  This is an amazing improvement.  Thanks @jpountz!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] TestFSTs.testRandomWords reproducible NPE [lucene]

2024-07-30 Thread via GitHub



slow-J commented on issue #13174:
URL: https://github.com/apache/lucene/issues/13174#issuecomment-2258837473

   Should we close this if this issue no longer manifests? I cannot repro on 
`main` at the newest commit 
https://github.com/apache/lucene/commit/30c965ea575a5e75c2bf724a340aa690d82f1ec5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-07-30 Thread via GitHub



javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1697317194


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
   List leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+// TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+// It must be reverted before merging.
+maxDocsPerSlice = 1;
+maxSegmentsPerSlice = 1;
+// end hack
+
 // Make a copy so we can sort:
 List sortedLeaves = new ArrayList<>(leaves);
 
 // Sort by maxDoc, descending:
-Collections.sort(
-sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
+sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc(;
 
-final List> groupedLeaves = new ArrayList<>();
-long docSum = 0;
-List group = null;
+final List> groupedLeafPartitions = new 
ArrayList<>();
+int currentSliceNumDocs = 0;
+List group = null;
 for (LeafReaderContext ctx : sortedLeaves) {
   if (ctx.reader().maxDoc() > maxDocsPerSlice) {
 assert group == null;
-groupedLeaves.add(Collections.singletonList(ctx));
+// if the segment does not fit in a single slice, we split it in 
multiple partitions of

Review Comment:
   Going back to this, especially on @stefanvodita 's comment above, it may 
make sense to introduce a different way to control the number of slices, 
instead of using the existing maxDocsPerSlice argument. It is kind of nice that 
we can apply the existing value and partition segments accordingly, but that 
can also be surprising and generate way too many partitions perhaps. Good thing 
to get further feedback on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Removing all deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager) [lucene]

2024-07-30 Thread via GitHub



javanna commented on PR #13617:
URL: https://github.com/apache/lucene/pull/13617#issuecomment-2258904454

   > My bad, sorry for the mistake, thanks for catching it and the review!
   
   no big deal, just a small oversight. Cheers!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-07-30 Thread via GitHub



kaivalnp commented on code in PR #13285:
URL: https://github.com/apache/lucene/pull/13285#discussion_r1697413263


##
lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java:
##
@@ -103,16 +114,18 @@ public Explanation explain(LeafReaderContext context, int 
doc) throws IOExceptio
   public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
 LeafReader leafReader = context.reader();
 Bits liveDocs = leafReader.getLiveDocs();
-final Scorer vectorSimilarityScorer;
+
+QueryTimeout queryTimeout = searcher.getTimeout();
+TimeLimitingKnnCollectorManager timeLimitingKnnCollectorManager =

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add timeout support to AbstractVectorSimilarityQuery [lucene]

2024-07-30 Thread via GitHub



kaivalnp commented on code in PR #13285:
URL: https://github.com/apache/lucene/pull/13285#discussion_r1697415078


##
lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java:
##
@@ -143,27 +156,23 @@ protected boolean match(int doc) {
   }
 
   // Perform an approximate search
-  TopDocs results = approximateSearch(context, acceptDocs, 
cardinality);
+  TopDocs results =
+  approximateSearch(context, acceptDocs, cardinality, 
timeLimitingKnnCollectorManager);
 
-  // If the limit was exhausted
-  if (results.totalHits.relation == 
TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO) {
-// Return a lazy-loading iterator
-vectorSimilarityScorer =
-VectorSimilarityScorer.fromAcceptDocs(
-this,
-boost,
-createVectorScorer(context),
-new BitSetIterator(acceptDocs, cardinality),
-resultSimilarity);
-  } else if (results.scoreDocs.length == 0) {
-return null;
-  } else {
+  if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO
+  // Return partial results only when timeout is met
+  || (queryTimeout != null && queryTimeout.shouldExit())) {
 // Return an iterator over the collected results
-vectorSimilarityScorer =
-VectorSimilarityScorer.fromScoreDocs(this, boost, 
results.scoreDocs);
+return VectorSimilarityScorerSupplier.fromScoreDocs(boost, 
results.scoreDocs);
+  } else {

Review Comment:
   Makes sense, I've added tests to check for a filter + non-null timeout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



gsmiller commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697210371


##
lucene/core/src/java/org/apache/lucene/util/BitSet.java:
##
@@ -92,6 +92,12 @@ public void clear() {
*/
   public abstract int nextSetBit(int index);
 
+  /**
+   * Returns the index of the first set bit from start (inclusive) until end 
(inclusive). {@link

Review Comment:
   I would tend to expect `end` to be exclusive in an API like this. (e.g., 
think of something like `String#substring(from, to)` in Java). WDYT about 
changing this?



##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -291,6 +291,32 @@ public int nextSetBit(int index) {
 return DocIdSetIterator.NO_MORE_DOCS;
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+// Depends on the ghost bits being clear!
+assert start >= 0 && start < numBits : "index=" + start + ", numBits=" + 
numBits;
+assert start <= upperBound : "index=" + start + ", upperBound=" + 
upperBound;
+int i = start >> 6;
+long word = bits[i] >> start; // skip all the bits to the right of index
+
+if (word != 0) {
+  int res = start + Long.numberOfTrailingZeros(word);
+  return res > upperBound ? DocIdSetIterator.NO_MORE_DOCS : res;
+}
+
+int maxWord = Math.min((upperBound >> 6) + 1, numWords);

Review Comment:
   nit: `maxWord` is slightly confusing naming IMO. I would have expected it to 
be an _inclusive_ max index but it's _exclusive_. The logic all makes sense to 
me, but maybe rename to something like `limit` (or change it to inclusive and 
keep the name)?



##
lucene/core/src/java/org/apache/lucene/util/SparseFixedBitSet.java:
##
@@ -353,14 +355,47 @@ public int nextSetBit(int i) {
 final long indexBits = index >>> i64 >>> 1;
 if (indexBits == 0) {
   // no more bits are set in the current block of 4096 bits, go to the 
next one
-  return firstDoc(i4096 + 1);
+  return firstDoc(i4096 + 1, indices.length);
 }
 // there are still set bits
 i64 += 1 + Long.numberOfTrailingZeros(indexBits);
 final long bits = bitArray[o];
 return (i64 << 6) | Long.numberOfTrailingZeros(bits);
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+assert start < length;
+assert upperBound >= start;
+final int i4096 = start >>> 12;
+final long index = indices[i4096];
+final long[] bitArray = this.bits[i4096];
+int i64 = start >>> 6;
+int o = Long.bitCount(index & ((1L << i64) - 1));

Review Comment:
   nit: since you repeat `1L << i64` a couple times here, maybe introduce 
`final long i64bit = 1L << i64;` (like we do in `#get`)



##
lucene/join/src/java/org/apache/lucene/search/join/BlockJoinSelector.java:
##
@@ -64,14 +64,19 @@ public boolean get(int docID) {
   return false;
 }
 
-final int firstChild = parents.prevSetBit(docID - 1) + 1;
-for (int child = children.nextSetBit(firstChild);
-child < docID;
-child = children.nextSetBit(child + 1)) {
-  if (docsWithValue.get(child)) {
-return true;
+final int lastPotentialChild = docID - 1;
+
+final int firstPotentialChild = parents.prevSetBit(lastPotentialChild) 
+ 1;
+if (firstPotentialChild < docID) {
+  for (int child = children.firstSetBitInRange(firstPotentialChild, 
lastPotentialChild);
+  child != DocIdSetIterator.NO_MORE_DOCS;
+  child = children.firstSetBitInRange(child + 1, 
lastPotentialChild)) {
+if (docsWithValue.get(child)) {
+  return true;
+}

Review Comment:
   nit: I would find this just slightly more readable, but I don't have a 
strong preference. Just a small suggestion.
   
   ```suggestion
   assert firstPotentialChild <= docID;
   if (firstPotentialChild == docID) {
 // no children
 return false;
   }
   for (int child = children.firstSetBitInRange(firstPotentialChild, 
lastPotentialChild);
child != DocIdSetIterator.NO_MORE_DOCS;
child = children.firstSetBitInRange(child + 1, 
lastPotentialChild)) {
 if (docsWithValue.get(child)) {
   return true;
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] TestFSTs.testRandomWords reproducible NPE [lucene]

2024-07-30 Thread via GitHub



stefanvodita commented on issue #13174:
URL: https://github.com/apache/lucene/issues/13174#issuecomment-2259007249

   It's likely that an issue might no longer be reproducible on a more recent 
commit because an unrelated change would have impacted the random numbers we 
generate. I wouldn't assume that this issue is fixed now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-07-30 Thread via GitHub



rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1697452900


##
lucene/core/src/c/dotProduct.c:
##
@@ -0,0 +1,143 @@
+// dotProduct.c
+
+#include 
+#include 
+
+#ifdef __ARM_ACLE
+#include 
+#endif
+
+#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) 
+#include 
+/*
+ * Unrolled and vectorized int8 dotProduct implementation using SVE 
instructions
+ * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below 
sucessfully 
+ * with '-march=native+sve' option but throws "Illegal Hardware Instruction" 
error
+ * Looks like Apple M3 does not implement SVE and Apple's official 
documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */

Review Comment:
   I think we should remove the ifdef, this does not happen with 
`-march=native`, correct? The problem is only you try to "force" SVE? afaik, M3 
doesnt support it and so `-march=native` should automatically take the neon 
path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-07-30 Thread via GitHub



rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1697455240


##
lucene/core/src/c/dotProduct.h:
##
@@ -0,0 +1,4 @@
+
+int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit);
+int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit);
+int32_t dot8s(int8_t* a, int8_t* b, int32_t limit);

Review Comment:
   can we fix these prototypes to all be the same? Can we include from the .c 
file? Maybe also dding `-Wall -Werror` will help keep the code tidy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-07-30 Thread via GitHub



rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1697462600


##
lucene/core/src/c/dotProduct.c:
##
@@ -0,0 +1,143 @@
+// dotProduct.c
+
+#include 
+#include 
+
+#ifdef __ARM_ACLE
+#include 
+#endif
+
+#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) 
+#include 
+/*
+ * Unrolled and vectorized int8 dotProduct implementation using SVE 
instructions
+ * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below 
sucessfully 
+ * with '-march=native+sve' option but throws "Illegal Hardware Instruction" 
error
+ * Looks like Apple M3 does not implement SVE and Apple's official 
documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */
+int32_t vdot8s_sve(int8_t *vec1, int8_t *vec2, int32_t limit) {

Review Comment:
   purely cosmetic, but maybe change these all to array syntax, which will be 
easier for java developers, and transform pointer arithmetic such as `vec1 + i 
+ 2 * vec_length` into `vec1[i + 2 * vec_length]`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-07-30 Thread via GitHub



rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1697469288


##
lucene/core/src/c/dotProduct.c:
##
@@ -0,0 +1,143 @@
+// dotProduct.c
+
+#include 
+#include 
+
+#ifdef __ARM_ACLE
+#include 
+#endif
+
+#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) 
+#include 
+/*
+ * Unrolled and vectorized int8 dotProduct implementation using SVE 
instructions
+ * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below 
sucessfully 
+ * with '-march=native+sve' option but throws "Illegal Hardware Instruction" 
error
+ * Looks like Apple M3 does not implement SVE and Apple's official 
documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */
+int32_t vdot8s_sve(int8_t *vec1, int8_t *vec2, int32_t limit) {
+int32_t result = 0;
+int32_t i = 0;
+// Vectors of 8-bit signed integers
+svint8_t va1, va2, va3, va4;
+svint8_t vb1, vb2, vb3, vb4;
+// Init accumulators
+svint32_t acc1 = svdup_n_s32(0);
+svint32_t acc2 = svdup_n_s32(0);
+svint32_t acc3 = svdup_n_s32(0);
+svint32_t acc4 = svdup_n_s32(0);
+
+// Number of 8-bits elements in the SVE vector
+int32_t vec_length = svcntb();
+
+// Manually unroll the loop
+for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
+   // Load vectors into the Z registers which can range from 128-bit to 
2048-bit wide
+   // The predicate register - P determines which bytes are active
+   // svptrue_b8() returns a predictae in which every element is true
+   //
+va1 = svld1_s8(svptrue_b8(), vec1 + i);
+vb1 = svld1_s8(svptrue_b8(), vec2 + i);
+
+va2 = svld1_s8(svptrue_b8(), vec1 + i + vec_length);
+vb2 = svld1_s8(svptrue_b8(), vec2 + i + vec_length);
+
+va3 = svld1_s8(svptrue_b8(), vec1 + i + 2 * vec_length);
+vb3 = svld1_s8(svptrue_b8(), vec2 + i + 2 * vec_length);
+
+va4 = svld1_s8(svptrue_b8(), vec1 + i + 3 * vec_length);
+vb4 = svld1_s8(svptrue_b8(), vec2 + i + 3 * vec_length);
+
+   // Dot product using SDOT instruction on Z vectors
+   acc1 = svdot_s32(acc1, va1, vb1);
+   acc2 = svdot_s32(acc2, va2, vb2);
+   acc3 = svdot_s32(acc3, va3, vb3);
+   acc4 = svdot_s32(acc4, va4, vb4);
+}   
+// Add correspponding active elements in each of the vectors 
+acc1 = svadd_s32_x(svptrue_b8() , acc1, acc2);
+acc3 = svadd_s32_x(svptrue_b8() , acc3, acc4);
+acc1 = svadd_s32_x(svptrue_b8(), acc1, acc3);
+
+// REDUCE: Add every vector element in target and write result to scalar
+result = svaddv_s32(svptrue_b8(), acc1);
+
+// Scalar tail. TODO: Use FMA

Review Comment:
   I think you can remove TODO, since aarch64 "mul" is really "madd", i expect 
it already emits single instruction. look at assembler if you are curious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Reproducible unit test failure [lucene]

2024-07-30 Thread via GitHub



slow-J opened a new issue, #13620:
URL: https://github.com/apache/lucene/issues/13620

   ### Description
   
   https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/11653/
   
   I can repro with `./gradlew test --tests 
TestTopDocsCollector.testResultsOrder -Dtests.seed=207A6071B3338CA6`
   
   The test passes if I revert 
https://github.com/apache/lucene/commit/d491dfe1315c80319547ec183be3d7aa902d2e9e
   
   This is bug happens I believe due to having 2 collectors with the same 
indexes and the score of `9.17561f` is never inserted into the pq.
   
   I found a fix, pending some tests.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-07-30 Thread via GitHub



rmuir commented on code in PR #13572:
URL: https://github.com/apache/lucene/pull/13572#discussion_r1697474113


##
lucene/core/src/c/dotProduct.c:
##
@@ -0,0 +1,143 @@
+// dotProduct.c
+
+#include 
+#include 
+
+#ifdef __ARM_ACLE
+#include 
+#endif
+
+#if (defined(__ARM_FEATURE_SVE) && !defined(__APPLE__)) 
+#include 
+/*
+ * Unrolled and vectorized int8 dotProduct implementation using SVE 
instructions
+ * NOTE: Clang 15.0 compiler on Apple M3 Max compiles the code below 
sucessfully 
+ * with '-march=native+sve' option but throws "Illegal Hardware Instruction" 
error
+ * Looks like Apple M3 does not implement SVE and Apple's official 
documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */
+int32_t vdot8s_sve(int8_t *vec1, int8_t *vec2, int32_t limit) {
+int32_t result = 0;
+int32_t i = 0;
+// Vectors of 8-bit signed integers
+svint8_t va1, va2, va3, va4;
+svint8_t vb1, vb2, vb3, vb4;
+// Init accumulators
+svint32_t acc1 = svdup_n_s32(0);
+svint32_t acc2 = svdup_n_s32(0);
+svint32_t acc3 = svdup_n_s32(0);
+svint32_t acc4 = svdup_n_s32(0);
+
+// Number of 8-bits elements in the SVE vector
+int32_t vec_length = svcntb();
+
+// Manually unroll the loop
+for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
+   // Load vectors into the Z registers which can range from 128-bit to 
2048-bit wide
+   // The predicate register - P determines which bytes are active
+   // svptrue_b8() returns a predictae in which every element is true
+   //
+va1 = svld1_s8(svptrue_b8(), vec1 + i);
+vb1 = svld1_s8(svptrue_b8(), vec2 + i);
+
+va2 = svld1_s8(svptrue_b8(), vec1 + i + vec_length);
+vb2 = svld1_s8(svptrue_b8(), vec2 + i + vec_length);
+
+va3 = svld1_s8(svptrue_b8(), vec1 + i + 2 * vec_length);
+vb3 = svld1_s8(svptrue_b8(), vec2 + i + 2 * vec_length);
+
+va4 = svld1_s8(svptrue_b8(), vec1 + i + 3 * vec_length);
+vb4 = svld1_s8(svptrue_b8(), vec2 + i + 3 * vec_length);
+
+   // Dot product using SDOT instruction on Z vectors
+   acc1 = svdot_s32(acc1, va1, vb1);
+   acc2 = svdot_s32(acc2, va2, vb2);
+   acc3 = svdot_s32(acc3, va3, vb3);
+   acc4 = svdot_s32(acc4, va4, vb4);
+}   

Review Comment:
   maybe consider "vector tail", since 4 * vector length can be significant: 
and vector dimensions may not be exact multiple of that. It limits the worst 
case processing that "scalar tail" must do. Example from the java vector code: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Fix failing unit test - TestTopDocsCollector#testResultsOrder [lucene]

2024-07-30 Thread via GitHub



slow-J opened a new pull request, #13621:
URL: https://github.com/apache/lucene/pull/13621

   Closes #13620
   
   I believe this bug happens due to having 2 collectors with the same indexes 
and the score of `9.17561f` is never inserted into the pq, failing the 
assertion.
   
   Another potential solution is having a SingletonMyTopDocsCollectorMananger 
which always returns the same `MyTopDocsCollector`.
   
   TEST: ` ./gradlew tidy && ./gradlew test` + `./gradlew test --tests 
TestTopDocsCollector.testResultsOrder -Dtests.seed=207A6071B3338CA6`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



epotyom commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697607067


##
lucene/core/src/java/org/apache/lucene/util/BitSet.java:
##
@@ -92,6 +92,12 @@ public void clear() {
*/
   public abstract int nextSetBit(int index);
 
+  /**
+   * Returns the index of the first set bit from start (inclusive) until end 
(inclusive). {@link

Review Comment:
   Ah, makes sense, will change that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



epotyom commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697608653


##
lucene/join/src/java/org/apache/lucene/search/join/BlockJoinSelector.java:
##
@@ -64,14 +64,19 @@ public boolean get(int docID) {
   return false;
 }
 
-final int firstChild = parents.prevSetBit(docID - 1) + 1;
-for (int child = children.nextSetBit(firstChild);
-child < docID;
-child = children.nextSetBit(child + 1)) {
-  if (docsWithValue.get(child)) {
-return true;
+final int lastPotentialChild = docID - 1;
+
+final int firstPotentialChild = parents.prevSetBit(lastPotentialChild) 
+ 1;
+if (firstPotentialChild < docID) {
+  for (int child = children.firstSetBitInRange(firstPotentialChild, 
lastPotentialChild);
+  child != DocIdSetIterator.NO_MORE_DOCS;
+  child = children.firstSetBitInRange(child + 1, 
lastPotentialChild)) {
+if (docsWithValue.get(child)) {
+  return true;
+}

Review Comment:
   I'll apply the suggestion, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



epotyom commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697615867


##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -291,6 +291,32 @@ public int nextSetBit(int index) {
 return DocIdSetIterator.NO_MORE_DOCS;
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+// Depends on the ghost bits being clear!
+assert start >= 0 && start < numBits : "index=" + start + ", numBits=" + 
numBits;
+assert start <= upperBound : "index=" + start + ", upperBound=" + 
upperBound;
+int i = start >> 6;
+long word = bits[i] >> start; // skip all the bits to the right of index
+
+if (word != 0) {
+  int res = start + Long.numberOfTrailingZeros(word);
+  return res > upperBound ? DocIdSetIterator.NO_MORE_DOCS : res;
+}
+
+int maxWord = Math.min((upperBound >> 6) + 1, numWords);

Review Comment:
   I've ended up removing `Math.min` here and making `upperBound <= bit set 
size` a requirement for the method - there is similar requirement for `index` 
parameter. Will push a commit shortly - please let me know what you think.



##
lucene/core/src/java/org/apache/lucene/util/SparseFixedBitSet.java:
##
@@ -353,14 +355,47 @@ public int nextSetBit(int i) {
 final long indexBits = index >>> i64 >>> 1;
 if (indexBits == 0) {
   // no more bits are set in the current block of 4096 bits, go to the 
next one
-  return firstDoc(i4096 + 1);
+  return firstDoc(i4096 + 1, indices.length);
 }
 // there are still set bits
 i64 += 1 + Long.numberOfTrailingZeros(indexBits);
 final long bits = bitArray[o];
 return (i64 << 6) | Long.numberOfTrailingZeros(bits);
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+assert start < length;
+assert upperBound >= start;
+final int i4096 = start >>> 12;
+final long index = indices[i4096];
+final long[] bitArray = this.bits[i4096];
+int i64 = start >>> 6;
+int o = Long.bitCount(index & ((1L << i64) - 1));

Review Comment:
   Make sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



gsmiller commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697623524


##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -291,6 +291,32 @@ public int nextSetBit(int index) {
 return DocIdSetIterator.NO_MORE_DOCS;
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+// Depends on the ghost bits being clear!
+assert start >= 0 && start < numBits : "index=" + start + ", numBits=" + 
numBits;
+assert start <= upperBound : "index=" + start + ", upperBound=" + 
upperBound;
+int i = start >> 6;
+long word = bits[i] >> start; // skip all the bits to the right of index
+
+if (word != 0) {
+  int res = start + Long.numberOfTrailingZeros(word);
+  return res > upperBound ? DocIdSetIterator.NO_MORE_DOCS : res;
+}
+
+int maxWord = Math.min((upperBound >> 6) + 1, numWords);

Review Comment:
   Ah, that's even better. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



epotyom commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697616026


##
lucene/core/src/java/org/apache/lucene/util/SparseFixedBitSet.java:
##
@@ -353,14 +355,47 @@ public int nextSetBit(int i) {
 final long indexBits = index >>> i64 >>> 1;
 if (indexBits == 0) {
   // no more bits are set in the current block of 4096 bits, go to the 
next one
-  return firstDoc(i4096 + 1);
+  return firstDoc(i4096 + 1, indices.length);
 }
 // there are still set bits
 i64 += 1 + Long.numberOfTrailingZeros(indexBits);
 final long bits = bitArray[o];
 return (i64 << 6) | Long.numberOfTrailingZeros(bits);
   }
 
+  @Override
+  public int firstSetBitInRange(int start, int upperBound) {
+assert start < length;
+assert upperBound >= start;
+final int i4096 = start >>> 12;
+final long index = indices[i4096];
+final long[] bitArray = this.bits[i4096];
+int i64 = start >>> 6;
+int o = Long.bitCount(index & ((1L << i64) - 1));

Review Comment:
   Makes sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] SparseFixedBitSet#firstDoc: reduce number of `indices` iterations for a bit set that is not fully built yet. [lucene]

2024-07-30 Thread via GitHub



gsmiller commented on code in PR #13559:
URL: https://github.com/apache/lucene/pull/13559#discussion_r1697630535


##
lucene/core/src/java/org/apache/lucene/util/BitSet.java:
##
@@ -92,6 +92,12 @@ public void clear() {
*/
   public abstract int nextSetBit(int index);
 
+  /**
+   * Returns the index of the first set bit from start (inclusive) until end 
(inclusive). {@link

Review Comment:
   Also... looking at the `#substring` method signature got me thinking about 
the API name. We if we reuse the `nextSetBit` naming for the new method since 
it's essentially the same thing but with an early-termination optimization. We 
could just overload that method signature with `#nextSetBit(int beginIndex, int 
endIndex)`. We could also later overload `#prevSetBit` in the same way if 
there's ever a real use-case. Pretty minor suggestion but might be a nice 
naming convention. Thoughts?



##
lucene/core/src/java/org/apache/lucene/util/BitSet.java:
##
@@ -92,6 +92,12 @@ public void clear() {
*/
   public abstract int nextSetBit(int index);
 
+  /**
+   * Returns the index of the first set bit from start (inclusive) until end 
(inclusive). {@link

Review Comment:
   Also... looking at the `#substring` method signature got me thinking about 
the API name. What if we reuse the `nextSetBit` naming for the new method since 
it's essentially the same thing but with an early-termination optimization. We 
could just overload that method signature with `#nextSetBit(int beginIndex, int 
endIndex)`. We could also later overload `#prevSetBit` in the same way if 
there's ever a real use-case. Pretty minor suggestion but might be a nice 
naming convention. Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] WIP do not merge [lucene]

2024-07-30 Thread via GitHub



github-actions[bot] commented on PR #13577:
URL: https://github.com/apache/lucene/pull/13577#issuecomment-2259403530

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] KnnFloatVectorQuery misses highest-ranking results that FloatVectorSimilarityQuery retrieves [lucene]

2024-07-30 Thread via GitHub



david-sitsky closed issue #13611: KnnFloatVectorQuery misses highest-ranking 
results that FloatVectorSimilarityQuery retrieves
URL: https://github.com/apache/lucene/issues/13611


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

46 matches

Mail list logo