[GitHub] [lucene] mikemccand commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-07 Thread via GitHub
mikemccand commented on issue #12536: URL: https://github.com/apache/lucene/issues/12536#issuecomment-170145 > In theory, if the skipper can tell us how many positions it has skipped that would work. This will require storing more information in the skip data than the current scheme.

[GitHub] [lucene] mikemccand commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-07 Thread via GitHub
mikemccand commented on PR #12541: URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710002412 Oh, it looks like `tidy` is angry -- can you run `./gradlew tidy` @Tony-X? This will re-style your new comment to match the required styling. Thanks! -- This is an automated messa

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710036830 I'm poking around trying to understand Tantivy's FST implementation, and found it was forked originally from [this FST implementation](https://github.com/BurntSushi/fst) into thi

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710044325 How the [blog post](https://blog.burntsushi.net/transducers/) models his cat reminds me of how I [modeled the scoring of a single tennis game as an FSA](https://blog.mikemccandle

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710089496 Aha! This is an interesting approach: ``` It is possible to mitigate the onerous memory required by sacrificing guaranteed minimality of the resulting FST. Namely, on

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710091688 The next paragraph in the blog is also very interesting! ``` An interesting consequence of using a bounded hash table which only stores some of the states is that cons

[GitHub] [lucene] mikemccand opened a new issue, #12543: `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory

2023-09-07 Thread via GitHub
mikemccand opened a new issue, #12543: URL: https://github.com/apache/lucene/issues/12543 ### Description [Spinoff from [this comment](https://github.com/apache/lucene/issues/12513#issuecomment-1710091688) inspired by Tantivy's FST implementation] The building of an FST is inh

[GitHub] [lucene] mikemccand commented on issue #12527: Optimize readInts24 performance for DocIdsWriter

2023-09-07 Thread via GitHub
mikemccand commented on issue #12527: URL: https://github.com/apache/lucene/issues/12527#issuecomment-1710234616 OK I tested the "read into scratch array" approach from [this comment](https://github.com/apache/lucene/issues/12527#issuecomment-1708857931): ```

[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710238828 Thanks @uschindler and @rmuir -- I will restore the 30s timeout, try to improve logging on client and server errors (so we can figure out WTF happened that caused client not to connec

[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub
jimczi commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710241290 I merged with the latest changes in main, the new random vector scorer integrates nicely with the changes added `https://github.com/apache/lucene/pull/12480`. The only difference is that

[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710320639 OK I made the changes! I also manually tested two failure modes of the clients: taking too long to initially connect, and throwing some sort of exception after testing the lock

[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710327041 > I did notice one odd thing: on test failure, I seemed to have a leftover test.lock in the root directory of the checkout, which is very odd. The test creates a new temp directory an

[GitHub] [lucene] benwtrent commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub
benwtrent commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710335794 Well, actually looking at the JFR, I cannot see anything that stands out. The percentages of compute time are still VERY similar when building index & querying. I may just be detecting

[GitHub] [lucene] dweiss commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-07 Thread via GitHub
dweiss commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710614193 I like it. These options we currently have are not even expert level, they're God-level... -- This is an automated message from the Apache Git Service. To respond to the message, p

[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub
jimczi commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710682508 Thanks for running the benchmarks @benwtrent . I agree that the difference seems to be in the noise. -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [lucene] Tony-X commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-07 Thread via GitHub
Tony-X commented on PR #12541: URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710736645 @mikemccand sure. Thanks for pointing out the useful `tidy` target :) I got it fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [lucene] madrob commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-07 Thread via GitHub
madrob commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710800611 What's the impact of having a non-minimal FST? Longer query times? Is that something that gets dwarfed by having multiple segments anyway? Maybe different merge policies have differe

[GitHub] [lucene] elliotzlin commented on pull request #1069: [LUCENE-2587] Highlighter fragment bug

2023-09-07 Thread via GitHub
elliotzlin commented on PR #1069: URL: https://github.com/apache/lucene/pull/1069#issuecomment-1711167476 @dsmiley apologies for my delay in getting back to your comment! I don't have any qualms about refactoring to deter people from using this. I took up this ticket more so to get involved