[GitHub] [lucene] jpountz closed pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

2022-10-06 Thread GitBox


jpountz closed pull request #227: LUCENE-10033: Encode numeric doc values and 
ordinals of SORTED(_SET) doc values in blocks.
URL: https://github.com/apache/lucene/pull/227


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on a diff in pull request #11832: Added static factory method for loading VectorValues

2022-10-06 Thread GitBox


shubhamvishu commented on code in PR #11832:
URL: https://github.com/apache/lucene/pull/11832#discussion_r988712945


##
lucene/core/src/java/org/apache/lucene/index/SlowCodecReaderWrapper.java:
##
@@ -163,7 +163,7 @@ private static KnnVectorsReader 
readerToVectorReader(LeafReader reader) {
 return new KnnVectorsReader() {
   @Override
   public VectorValues getVectorValues(String field) throws IOException {
-return reader.getVectorValues(field);
+return VectorValues.getVectorValues(reader, field);

Review Comment:
   > The new method looks good, but it's no longer used anywhere under src/java 
anymore, which makes me wonder if it's actually helpful.
   
   One reasoning is we can have this to be consistent like how we have similar 
static functions in `DocValues`. There no code using it right now but this 
could be used maybe in future?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11839: gradle aggregate coverage report

2022-10-06 Thread GitBox


rmuir commented on issue #11839:
URL: https://github.com/apache/lucene/issues/11839#issuecomment-1269611646

   @zhaih please take it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11832: Added static factory method for loading VectorValues

2022-10-06 Thread GitBox


jpountz commented on PR #11832:
URL: https://github.com/apache/lucene/pull/11832#issuecomment-1269629285

   > One reasoning is we can have this to be consistent like how we have 
similar static functions in DocValues. There no code using it right now but 
this could be used maybe in future?
   
   I'm starting to have more second thoughts about whether we should actually 
add this new method. On doc values, I'm seeing value because it helps save a 
few annoying null checks and then everything should work as expected on the 
empty instances. Saving null checks applies for vectors too, but the `EMPTY` 
vectors instance also returns `0` for the `dimension()` method, which is 
something I could see some code paths to complain about if they end up seeing 
different values for the dimension across different segments of the same index. 
So it's not clear to me that it would prevent more bugs than it would introduce.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11831: GITHUB-11761: Move minimum TieredMergePolicy delete percentage from 2…

2022-10-06 Thread GitBox


jpountz commented on PR #11831:
URL: https://github.com/apache/lucene/pull/11831#issuecomment-1269633237

   Why did we have to update the number of allowed deletes back to 33% on some 
tests, did they fail otherwise? Is there another way how we could improve these 
tests to cope better with the new default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11831: GITHUB-11761: Move minimum TieredMergePolicy delete percentage from 2…

2022-10-06 Thread GitBox


jpountz commented on PR #11831:
URL: https://github.com/apache/lucene/pull/11831#issuecomment-1269635726

   E.g. maybe it would be better to add more documents in 
`testNRTIsCurrentAfterDelete` in order for the single delete not to introduce 
more than 20% deletes, and keep the `TieredMergePolicy` defaults?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

2022-10-06 Thread GitBox


benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270205020

   I was able to replicate @jtibshirani results. 
   
   Using the `glove-100-angular` from ann-benchmarks, here is the graph data 
size:
   
   ```
   Baseline:
   154M  _0_Lucene94HnswVectorsFormat_0.vex
   
   PackedInts:
   103M _0_Lucene94HnswVectorsFormat_0.vex
   ```
   
   Now, the implementation for `PackedInts` followed closely to @jtibshirani 
"hacky patch". There are various changes required on read, and new information 
needs to be stored in regards to the PackedInts version used. In all 
(disregarding the fact that a new codec must be written), the code change is 
pretty small. 
   
   Other good news is that when running ann-benchmark on `glove-100-angular`, 
there is no discernible difference in search performance. I am not sure how 
much traffic the off-heap reader gets in this benchmark. It seems to be it 
would be read once and stored in memory, though I might be misunderstanding 
something. 
   
   @jtibshirani @mayya-sharipova what do y'all think? Any ideas on how to force 
off-heap for search for better numbers on search changes?
   
   Honestly, we may not care about slightly slower off-heap search performance 
as smaller sizes are more likely to be in the page cache.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11830: Store HNSW graph connections more compactly

2022-10-06 Thread GitBox


jtibshirani commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270359029

   @benwtrent thanks for looking into this! To clarify, `OffHeapHnswGraph` will 
always be used for searches. Usually it will be paged in from disk before heavy 
searching, but it's still the same data structure. The on-heap version is only 
used for indexing.
   
   Could you include a comparison of the search latency and recall numbers you 
see with this approach? Sometimes with our benchmarks it's easy to miss small 
differences in performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #11830: Store HNSW graph connections more compactly

2022-10-06 Thread GitBox


benwtrent commented on issue #11830:
URL: https://github.com/apache/lucene/issues/11830#issuecomment-1270415680

   > Could you include a comparison of the search latency and recall numbers 
you see with this approach? Sometimes with our benchmarks it's easy to miss 
small differences in performance.
   
   Yes, I can provide some images and numbers. What I saw when running 
`glove-100-angular` batched is that there is no difference. Will run more tests 
for sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11838: Adding concurrency to query rewrite?

2022-10-06 Thread GitBox


jtibshirani commented on issue #11838:
URL: https://github.com/apache/lucene/issues/11838#issuecomment-1270564173

   This is a great question! To me it'd make sense to try to move costly steps 
into weight creation or scoring. It feels a little "off" to do the bulk of a 
query's work during rewrite.
   
   However, I understand why we chose to use rewrite for `KnnVectorQuery`. It 
allows us the query to match the global top `k` across segments, as opposed to 
matching `k` docs per segment. This means the number of documents it matches 
doesn't depend on the number of segments, and it's easier to combine with other 
queries. Maybe this is not the right trade-off though in terms of behavior vs. 
performance 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org