[GitHub] [lucene] mikemccand commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat
mikemccand commented on issue #12536: URL: https://github.com/apache/lucene/issues/12536#issuecomment-170145 > In theory, if the skipper can tell us how many positions it has skipped that would work. This will require storing more information in the skip data than the current scheme. OK, and it seems like that would not be a good tradeoff? Every skip entry would need to record how many total positions were skipped, whereas the `lastPosBlockOffset` is just a single long for the entire postings lists? Thanks for the PR improving the docs -- I'll review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12541: Document why we need `lastPosBlockOffset`
mikemccand commented on PR #12541: URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710002412 Oh, it looks like `tidy` is angry -- can you run `./gradlew tidy` @Tony-X? This will re-style your new comment to match the required styling. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710036830 I'm poking around trying to understand Tantivy's FST implementation, and found it was forked originally from [this FST implementation](https://github.com/BurntSushi/fst) into this [Tantivy specific version](https://github.com/quickwit-inc/fst) (which seems to have fallen behind merging the upstream changes?). There is a [wonderful blog post describing it](https://blog.burntsushi.net/transducers/). Now I want to try building a Lucene FST from that giant [Common Crawl corpus](https://commoncrawl.org/) -- 1.6 B URLs! Some clear initial differences over Lucene's implementation: * The original fst package (linked above) can build Levenshtein FSTs! Lucene can build Levenshtein Automata, but not FSTs. * It can also search FSTs using regexps! Lucene can do that w/ Automaton, but not FSTs. * Generally, the Rust FST implementation does a stronger job unifying Automata and FSTs, whereas in Lucene these are strongly divorced classes despite having clear overlapping functionality. * Building the FST looks crazy fast compared to Lucene -- I'm really curious how it works :) Specifically, how the suffixes are shared -- this uses tons of RAM in Lucene to ensure precisely minimal FST. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710044325 How the [blog post](https://blog.burntsushi.net/transducers/) models his cat reminds me of how I [modeled the scoring of a single tennis game as an FSA](https://blog.mikemccandless.com/2014/08/scoring-tennis-using-finite-state.html), and uncovered that there is absolutely no difference between `deuce` and `30 all` if you just want to know who won the game! If there are any tennis players reading this, you can save a wee bit of brain state when tracking the score! Just announce `deuce` once you get to `30 all` and never say `30 all` again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710089496 Aha! This is an interesting approach: ``` It is possible to mitigate the onerous memory required by sacrificing guaranteed minimality of the resulting FST. Namely, one can maintain a hash table that is bounded in size. This means that commonly reused states are kept in the hash table while less commonly reused states are evicted. In practice, a hash table with about 10,000 slots achieves a decent compromise and closely approximates minimality in my own unscientific experiments. (The actual implementation does a little better and stores a small LRU cache in each slot, so that if two common but distinct nodes map to the same bucket, they can still be reused.) ``` Lucene can also bound the size of this "suffix hashmap" using the [crazy cryptic `minSufixCount1`, `minSuffixCount2`, and `sharedMaxTailLength` parameters](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L107-L111), but these are a poor (and basically unintelligible!) way to bound the map versus what Tantivy FST does using this LRU style cache. Yes, it sacrifices truly minimal FST, but in practice it looks like it gets close enough, while massively reducing RAM required during construction. I'll open a spinoff issue for this... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710091688 The next paragraph in the blog is also very interesting! ``` An interesting consequence of using a bounded hash table which only stores some of the states is that construction of an FST can be streamed to a file on disk. Namely, when states are frozen as described in the previous two sections, there’s no reason to keep all of them in memory. Instead, we can immediately write them to disk (or a socket, or whatever). ``` Since Lucene's FSTs are now "off-heap", we could take a similar approach. I think this is lower priority, but I'll open a spinoff issue for it too... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand opened a new issue, #12543: `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory
mikemccand opened a new issue, #12543: URL: https://github.com/apache/lucene/issues/12543 ### Description [Spinoff from [this comment](https://github.com/apache/lucene/issues/12513#issuecomment-1710091688) inspired by Tantivy's FST implementation] The building of an FST is inherently streamable: the way the FST freezes states as it processes inputs is a write-once, roughly append-only operation. Today, Lucene holds this growing `byte[]` entirely in RAM, and once done, writes the whole thing to disk. Yet at search time, Lucene searches the FST off-heap, doing nearly random backwards IO through `IndexInput`. Let's fix Lucene to stream the FST `byte[]` directly to `IndexOutput`? This would reduce the RAM required to build so that it is constant regardless of how large an FST you are building / how many input/output pairs you are adding to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12527: Optimize readInts24 performance for DocIdsWriter
mikemccand commented on issue #12527: URL: https://github.com/apache/lucene/issues/12527#issuecomment-1710234616 OK I tested the "read into scratch array" approach from [this comment](https://github.com/apache/lucene/issues/12527#issuecomment-1708857931): ``` TaskQPS base StdDevQPS readLongs StdDevPct diff p-value IntNRQ 676.26 (3.7%) 607.60 (3.2%) -10.2% ( -16% - -3%) 0.000 BrowseDayOfYearSSDVFacets 13.19 (12.2%) 12.70 (10.2%) -3.7% ( -23% - 21%) 0.296 HighTermDayOfYearSort 676.54 (1.4%) 653.90 (1.2%) -3.3% ( -5% -0%) 0.000 TermDTSort 291.54 (1.6%) 285.07 (1.4%) -2.2% ( -5% -0%) 0.000 BrowseDayOfYearTaxoFacets8.47 (8.6%)8.32 (5.0%) -1.8% ( -14% - 12%) 0.417 BrowseDateTaxoFacets8.50 (8.4%)8.35 (5.1%) -1.7% ( -14% - 12%) 0.428 HighSloppyPhrase 27.18 (2.5%) 26.90 (3.3%) -1.0% ( -6% -4%) 0.259 MedPhrase 12.74 (7.0%) 12.65 (7.2%) -0.7% ( -13% - 14%) 0.747 LowTerm 820.76 (3.4%) 815.74 (3.5%) -0.6% ( -7% -6%) 0.574 HighTermTitleBDVSort5.39 (3.5%)5.35 (2.5%) -0.6% ( -6% -5%) 0.557 BrowseRandomLabelSSDVFacets9.18 (5.4%)9.13 (5.8%) -0.6% ( -11% - 11%) 0.753 HighPhrase 10.16 (4.9%) 10.11 (4.9%) -0.5% ( -9% -9%) 0.739 OrHighHigh 26.04 (5.6%) 25.91 (4.4%) -0.5% ( -9% - 10%) 0.745 LowPhrase 13.05 (3.4%) 12.98 (3.4%) -0.5% ( -7% -6%) 0.650 BrowseRandomLabelTaxoFacets7.68 (4.8%)7.64 (3.1%) -0.5% ( -8% -7%) 0.703 BrowseDateSSDVFacets2.19 (1.1%)2.18 (1.4%) -0.4% ( -2% -2%) 0.323 Prefix3 206.91 (5.3%) 206.16 (5.7%) -0.4% ( -10% - 11%) 0.835 AndHighLow 700.11 (2.1%) 697.60 (1.6%) -0.4% ( -3% -3%) 0.545 LowSloppyPhrase 71.11 (1.9%) 70.87 (2.2%) -0.3% ( -4% -3%) 0.599 OrNotHighLow 666.63 (1.5%) 664.60 (1.6%) -0.3% ( -3% -2%) 0.539 MedSloppyPhrase 51.20 (4.6%) 51.07 (5.1%) -0.3% ( -9% -9%) 0.869 OrHighNotHigh 356.18 (4.7%) 355.40 (4.7%) -0.2% ( -9% -9%) 0.882 HighTermMonthSort 3590.99 (1.1%) 3584.45 (0.9%) -0.2% ( -2% -1%) 0.576
[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710238828 Thanks @uschindler and @rmuir -- I will restore the 30s timeout, try to improve logging on client and server errors (so we can figure out WTF happened that caused client not to connect in 30s window in future build failures), increase client connect timeout from 500ms -> 3s. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher
jimczi commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710241290 I merged with the latest changes in main, the new random vector scorer integrates nicely with the changes added `https://github.com/apache/lucene/pull/12480`. The only difference is that the scorer is now exposed in the API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710320639 OK I made the changes! I also manually tested two failure modes of the clients: taking too long to initially connect, and throwing some sort of exception after testing the lock N times. In both cases the test stderr showed the stderr/out from the clients, hopefully making it easier to test in the future. I did notice one odd thing: on test failure, I seemed to have a leftover `test.lock` in the root directory of the checkout, which is very odd. The test creates a new temp directory and is supposed to use that directory to create its test lock ... so I'm not yet sure how that happened. Or maybe I am just hallucinating ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout
mikemccand commented on PR #12535: URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710327041 > I did notice one odd thing: on test failure, I seemed to have a leftover test.lock in the root directory of the checkout, which is very odd. The test creates a new temp directory and is supposed to use that directory to create its test lock ... so I'm not yet sure how that happened. Or maybe I am just hallucinating ... OK nevermind -- I'm pretty sure this was leftover from me manually invoking `LockVerifyServer` and `LockStressTest` myself. Phew. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher
benwtrent commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710335794 Well, actually looking at the JFR, I cannot see anything that stands out. The percentages of compute time are still VERY similar when building index & querying. I may just be detecting noise. `lucene_candidate` jfr is this PR, `lucene_baseline` jfr is latest main. [Archive.zip](https://github.com/apache/lucene/files/12550717/Archive.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality
dweiss commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710614193 I like it. These options we currently have are not even expert level, they're God-level... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher
jimczi commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710682508 Thanks for running the benchmarks @benwtrent . I agree that the difference seems to be in the noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Tony-X commented on pull request #12541: Document why we need `lastPosBlockOffset`
Tony-X commented on PR #12541: URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710736645 @mikemccand sure. Thanks for pointing out the useful `tidy` target :) I got it fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] madrob commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality
madrob commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710800611 What's the impact of having a non-minimal FST? Longer query times? Is that something that gets dwarfed by having multiple segments anyway? Maybe different merge policies have different defaults - when using tiered merges we can have some slack and when merging everything down to a single segment we probably should take the time to ensure minimality anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] elliotzlin commented on pull request #1069: [LUCENE-2587] Highlighter fragment bug
elliotzlin commented on PR #1069: URL: https://github.com/apache/lucene/pull/1069#issuecomment-1711167476 @dsmiley apologies for my delay in getting back to your comment! I don't have any qualms about refactoring to deter people from using this. I took up this ticket more so to get involved with contributing to the Lucene project and found this in the backlog, and less so because I was using the Highlighter in a project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org