[GitHub] [lucene] jpountz opened a new issue, #11915: Make Lucene smarter about long runs of matches

2022-11-10 Thread GitBox
jpountz opened a new issue, #11915: URL: https://github.com/apache/lucene/issues/11915 ### Description Lucene's abstractions are good at dealing with long runs of documents that do not match a query, but much less at dealing with long runs of documents that match a query. In such cas

[GitHub] [lucene] rendel commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-11-10 Thread GitBox
rendel commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309968917 > I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term. @jpountz This ma

[GitHub] [lucene] jpountz commented on pull request #11888: [Fix] Binary search the entries when all suffixes have the same length in a leaf block.

2022-11-10 Thread GitBox
jpountz commented on PR #11888: URL: https://github.com/apache/lucene/pull/11888#issuecomment-1310043887 Thanks @vsop-479. Do you know if the test you added to terms can be improved in such a way that it would have caught this bug? -- This is an automated message from the Apache Git Servi

[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.

2022-11-10 Thread GitBox
jfboeuf commented on code in PR #11900: URL: https://github.com/apache/lucene/pull/11900#discussion_r1019029965 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java: ## @@ -46,7 +46,9 @@ public class FuzzySet implements Accountable { public static final in

[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.

2022-11-10 Thread GitBox
jfboeuf commented on code in PR #11900: URL: https://github.com/apache/lucene/pull/11900#discussion_r1019030241 ## lucene/codecs/src/java/org/apache/lucene/codecs/bloom/MurmurHash64.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

[GitHub] [lucene] rmuir commented on issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)

2022-11-10 Thread GitBox
rmuir commented on issue #11911: URL: https://github.com/apache/lucene/issues/11911#issuecomment-1310217848 "read every byte of the index" is the promise that checkindex makes. So this bug is really important. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [lucene] rmuir closed pull request #11906: Add monster test for many knn docs

2022-11-10 Thread GitBox
rmuir closed pull request #11906: Add monster test for many knn docs URL: https://github.com/apache/lucene/pull/11906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs

2022-11-10 Thread GitBox
rmuir commented on PR #11906: URL: https://github.com/apache/lucene/pull/11906#issuecomment-1310221605 this test is folded into #11905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [lucene] rmuir merged pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-10 Thread GitBox
rmuir merged PR #11905: URL: https://github.com/apache/lucene/pull/11905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-11-10 Thread GitBox
jpountz commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1310330397 Apologies Luca, but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null PointValu

[GitHub] [lucene] jpountz commented on issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-11-10 Thread GitBox
jpountz commented on issue #11393: URL: https://github.com/apache/lucene/issues/11393#issuecomment-1310333691 I had hoped that getting rid of ghost fields would automatically help avoid some bugs but after looking into it for both postings and points (thanks @javanna and @shahrs87 !) it loo

[GitHub] [lucene] jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-11-10 Thread GitBox
jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357] URL: https://github.com/apache/lucene/issues/11393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] jpountz commented on pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox
jpountz commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310334991 Apologies @javanna, but after looking more at your changes, I'm getting worried that this change is harder than I had anticipated. I was optimistically hoping that never returning null P

[GitHub] [lucene] javanna closed pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox
javanna closed pull request #11793: Prevent PointValues from returning null for ghost fields URL: https://github.com/apache/lucene/pull/11793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

[GitHub] [lucene] javanna commented on pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox
javanna commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310370411 Agreed @jpountz I think it was a good experiment to spend some time on, and I have also been thinking along the same lines, that the changes I ended up making were not solving the proble

[GitHub] [lucene] benwtrent opened a new pull request, #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
benwtrent opened a new pull request, #11916: URL: https://github.com/apache/lucene/pull/11916 Checkindex with vectors should exercise the graph and seek operations. These are exposed via the search interface. There is the option to search EVERY stored vector value as we iterate it, b

[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310547475 @rmuir I took a stab at it. I am unfamiliar with checkindex, but this will search 64 vectors, seeking the graph to catch if there is something obscene broken. A more complicated

[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310577261 Thank you, yeah this is fine as a start! I think, it would be an improvement in the future to not just search the first 64 vectors but maybe every n'th (just a different form of sampling).

[GitHub] [lucene] jpountz opened a new pull request, #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox
jpountz opened a new pull request, #11917: URL: https://github.com/apache/lucene/pull/11917 The default codec has a number of small and hot files, that actually used to be fully loaded in memory before we moved them off-heap. In the general case, these files are expected to fully fit into t

[GitHub] [lucene] uschindler opened a new pull request, #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox
uschindler opened a new pull request, #11918: URL: https://github.com/apache/lucene/pull/11918 This also adds incorrect (e.g., negative) positions to exception message. This also fixes some wrong exception messages (seek vs. read) in ByteBufferIndexInput. Sometimes it said "seek" alth

[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310726064 The new test is a bit bad, but unfortunately, MMapDirectory's multi-input only has an assert in seek(). If that hits, test also passses. In reality negative offsets on slices should a

[GitHub] [lucene] jtibshirani commented on issue #11863: Add large-scale test for kNN vectors

2022-11-10 Thread GitBox
jtibshirani commented on issue #11863: URL: https://github.com/apache/lucene/issues/11863#issuecomment-1310797975 In https://github.com/apache/lucene/pull/11905 we added a test for a large number of documents (with a tiny dimension). It'd also be good to clean up and merge something l

[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310890561 thanks! I like it. Feel free to add a CHANGES entry if you want, it is a good one for that, because checkindex is user-visible and important. I would suggest in the 9.4.2 section as that's

[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310907074 pushed CHANGES under 9.4.2 as an `Improvement` @rmuir -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [lucene] rmuir merged pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
rmuir merged PR #11916: URL: https://github.com/apache/lucene/pull/11916 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

[GitHub] [lucene] rmuir closed issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)

2022-11-10 Thread GitBox
rmuir closed issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking) URL: https://github.com/apache/lucene/issues/11911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310952543 Should I backport this also to 9.4.2 when it gets released next week. I am afraid of more horrible bugs in vectors and I'd like to give people a chance to report it. Problem of

[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310961095 @benwtrent I hit issue upon backporting to branch_9x: it may be nothing specific to 9.x but just a random seed that hasn't been encountered yet on master? The checkindex error messa

[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
rmuir commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310962304 Another idea, perhaps even simpler, is not to filter deleteddocs at all here in this logic. Because checkindex doesnt normally exclude deleted docs and just checks everything. -- This i

[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox
benwtrent commented on PR #11916: URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310966373 @rmuir 100%. I reproduced it with that seed, then removed the deleted docs check and it cleared up. I bet its because ALL the docs were deleted or something. -- This is an automated

[GitHub] [lucene] benwtrent opened a new pull request, #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox
benwtrent opened a new pull request, #11919: URL: https://github.com/apache/lucene/pull/11919 There is a chance that all the docs are deleted. This is ok in a checkindex scenario and other checks don't bother with verifying deleted docs like this. Removing the check. This repro

[GitHub] [lucene] rmuir commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox
rmuir commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310970223 looks good, thank you for making the PR so fast. The test failure reproduces and with this change it passes again. -- This is an automated message from the Apache Git Service. To respond

[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox
benwtrent commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971424 Ran ``` ./gradlew test --tests TestLucene94HnswVectorsFormat -Dtests.iters=1000 ``` Just to be sure we are good. All green locally. @rmuir -- This is an automated

[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox
benwtrent commented on PR #11919: URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971694 Apologies for the noise! Still learning all of Lucene's edges :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[GitHub] [lucene] rmuir commented on pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox
rmuir commented on PR #11917: URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310974545 I think preload is different from mlock, mlock needs way more discussion and personally I'm against it. mlock would be an operational hassle because of default resource limits on linux as

[GitHub] [lucene] uschindler commented on a diff in pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox
uschindler commented on code in PR #11917: URL: https://github.com/apache/lucene/pull/11917#discussion_r1019662247 ## lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java: ## @@ -235,7 +235,7 @@ public IndexInput openInput(String name, IOContext context) throws IOExc

[GitHub] [lucene] uschindler commented on pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox
uschindler commented on PR #11917: URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310976902 > I think preload is different from mlock, mlock needs way more discussion and personally I'm against it. mlock would be an operational hassle because of default resource limits on li

[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox
uschindler commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310991790 Ah you added milestone 9.4.2 already to issue. Will do same here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-11-10 Thread GitBox
uschindler commented on PR #912: URL: https://github.com/apache/lucene/pull/912#issuecomment-1311033936 After Mike switched to preview mode the results look good. The speed with MemorySegmentIndexInput ist similar to old ByteBuffer code. https://home.apache.org/~mikemccand/lucenebench

[GitHub] [lucene] rmuir merged pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox
rmuir merged PR #11919: URL: https://github.com/apache/lucene/pull/11919 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

[GitHub] [lucene] rmuir commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox
rmuir commented on PR #11918: URL: https://github.com/apache/lucene/pull/11918#issuecomment-1311107401 yes, +1 to backport. This way if there is another problem, it might be easier to debug. -- This is an automated message from the Apache Git Service. To respond to the message, please log