[GitHub] [lucene] mikemccand commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-07 Thread via GitHub


mikemccand commented on issue #12536:
URL: https://github.com/apache/lucene/issues/12536#issuecomment-170145

   > In theory, if the skipper can tell us how many positions it has skipped 
that would work. This will require storing more information in the skip data 
than the current scheme.
   
   OK, and it seems like that would not be a good tradeoff?  Every skip entry 
would need to record how many total positions were skipped, whereas the 
`lastPosBlockOffset` is just a single long for the entire postings lists?  
Thanks for the PR improving the docs -- I'll review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-07 Thread via GitHub


mikemccand commented on PR #12541:
URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710002412

   Oh, it looks like `tidy` is angry -- can you run `./gradlew tidy` @Tony-X?  
This will re-style your new comment to match the required styling.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710036830

   I'm poking around trying to understand Tantivy's FST implementation, and 
found it was forked originally from [this FST 
implementation](https://github.com/BurntSushi/fst) into this [Tantivy specific 
version](https://github.com/quickwit-inc/fst) (which seems to have fallen 
behind merging the upstream changes?).
   
   There is a [wonderful blog post describing 
it](https://blog.burntsushi.net/transducers/).  Now I want to try building a 
Lucene FST from that giant [Common Crawl corpus](https://commoncrawl.org/) -- 
1.6 B URLs!
   
   Some clear initial differences over Lucene's implementation:
 * The original fst package (linked above) can build Levenshtein FSTs!  
Lucene can build Levenshtein Automata, but not FSTs.
 * It can also search FSTs using regexps!  Lucene can do that w/ Automaton, 
but not FSTs.
 * Generally, the Rust FST implementation does a stronger job unifying 
Automata and FSTs, whereas in Lucene these are strongly divorced classes 
despite having clear overlapping functionality.
 * Building the FST looks crazy fast compared to Lucene -- I'm really 
curious how it works :)  Specifically, how the suffixes are shared -- this uses 
tons of RAM in Lucene to ensure precisely minimal FST.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710044325

   How the [blog post](https://blog.burntsushi.net/transducers/) models his cat 
reminds me of how I [modeled the scoring of a single tennis game as an 
FSA](https://blog.mikemccandless.com/2014/08/scoring-tennis-using-finite-state.html),
 and uncovered that there is absolutely no difference between `deuce` and `30 
all` if you just want to know who won the game!
   
   If there are any tennis players reading this, you can save a wee bit of 
brain state when tracking the score!  Just announce `deuce` once you get to `30 
all` and never say `30 all` again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710089496

   Aha!  This is an interesting approach:
   
   ```
   It is possible to mitigate the onerous memory required by sacrificing
   guaranteed minimality of the resulting FST. Namely, one can maintain a
   hash table that is bounded in size. This means that commonly reused
   states are kept in the hash table while less commonly reused states
   are evicted. In practice, a hash table with about 10,000 slots
   achieves a decent compromise and closely approximates minimality in my
   own unscientific experiments. (The actual implementation does a little
   better and stores a small LRU cache in each slot, so that if two
   common but distinct nodes map to the same bucket, they can still be
   reused.)
   ```
   
   Lucene can also bound the size of this "suffix hashmap" using the [crazy 
cryptic `minSufixCount1`, `minSuffixCount2`, and `sharedMaxTailLength` 
parameters](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L107-L111),
 but these are a poor (and basically unintelligible!) way to bound the map 
versus what Tantivy FST does using this LRU style cache.  Yes, it sacrifices 
truly minimal FST, but in practice it looks like it gets close enough, while 
massively reducing RAM required during construction.  I'll open a spinoff issue 
for this...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-09-07 Thread via GitHub


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710091688

   The next paragraph in the blog is also very interesting!
   
   ```
   An interesting consequence of using a bounded hash table which only
   stores some of the states is that construction of an FST can be
   streamed to a file on disk. Namely, when states are frozen as
   described in the previous two sections, there’s no reason to keep all
   of them in memory. Instead, we can immediately write them to disk (or
   a socket, or whatever).
   ```
   
   Since Lucene's FSTs are now "off-heap", we could take a similar approach.  I 
think this is lower priority, but I'll open a spinoff issue for it too...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand opened a new issue, #12543: `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory

2023-09-07 Thread via GitHub


mikemccand opened a new issue, #12543:
URL: https://github.com/apache/lucene/issues/12543

   ### Description
   
   [Spinoff from [this 
comment](https://github.com/apache/lucene/issues/12513#issuecomment-1710091688) 
inspired by Tantivy's FST implementation]
   
   The building of an FST is inherently streamable: the way the FST freezes 
states as it processes inputs is a write-once, roughly append-only operation.  
Today, Lucene holds this growing `byte[]` entirely in RAM, and once done, 
writes the whole thing to disk.
   
   Yet at search time, Lucene searches the FST off-heap, doing nearly random 
backwards IO through `IndexInput`.
   
   Let's fix Lucene to stream the FST `byte[]` directly to `IndexOutput`?  This 
would reduce the RAM required to build so that it is constant regardless of how 
large an FST you are building / how many input/output pairs you are adding to 
it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12527: Optimize readInts24 performance for DocIdsWriter

2023-09-07 Thread via GitHub


mikemccand commented on issue #12527:
URL: https://github.com/apache/lucene/issues/12527#issuecomment-1710234616

   OK I tested the "read into scratch array" approach from [this 
comment](https://github.com/apache/lucene/issues/12527#issuecomment-1708857931):
   
   ```
   TaskQPS base  StdDevQPS readLongs  
StdDevPct diff p-value  

 IntNRQ  676.26  (3.7%)  607.60  
(3.2%)  -10.2% ( -16% -   -3%) 0.000
 
  BrowseDayOfYearSSDVFacets   13.19 (12.2%)   12.70 
(10.2%)   -3.7% ( -23% -   21%) 0.296   
  
  HighTermDayOfYearSort  676.54  (1.4%)  653.90  
(1.2%)   -3.3% (  -5% -0%) 0.000
 
 TermDTSort  291.54  (1.6%)  285.07  
(1.4%)   -2.2% (  -5% -0%) 0.000
 
  BrowseDayOfYearTaxoFacets8.47  (8.6%)8.32  
(5.0%)   -1.8% ( -14% -   12%) 0.417
 
   BrowseDateTaxoFacets8.50  (8.4%)8.35  
(5.1%)   -1.7% ( -14% -   12%) 0.428
 
   HighSloppyPhrase   27.18  (2.5%)   26.90  
(3.3%)   -1.0% (  -6% -4%) 0.259
 
  MedPhrase   12.74  (7.0%)   12.65  
(7.2%)   -0.7% ( -13% -   14%) 0.747
 
LowTerm  820.76  (3.4%)  815.74  
(3.5%)   -0.6% (  -7% -6%) 0.574
 
   HighTermTitleBDVSort5.39  (3.5%)5.35  
(2.5%)   -0.6% (  -6% -5%) 0.557
 
BrowseRandomLabelSSDVFacets9.18  (5.4%)9.13  
(5.8%)   -0.6% ( -11% -   11%) 0.753
 
 HighPhrase   10.16  (4.9%)   10.11  
(4.9%)   -0.5% (  -9% -9%) 0.739
 
 OrHighHigh   26.04  (5.6%)   25.91  
(4.4%)   -0.5% (  -9% -   10%) 0.745
 
  LowPhrase   13.05  (3.4%)   12.98  
(3.4%)   -0.5% (  -7% -6%) 0.650
 
BrowseRandomLabelTaxoFacets7.68  (4.8%)7.64  
(3.1%)   -0.5% (  -8% -7%) 0.703
 
   BrowseDateSSDVFacets2.19  (1.1%)2.18  
(1.4%)   -0.4% (  -2% -2%) 0.323
 
Prefix3  206.91  (5.3%)  206.16  
(5.7%)   -0.4% ( -10% -   11%) 0.835
 
 AndHighLow  700.11  (2.1%)  697.60  
(1.6%)   -0.4% (  -3% -3%) 0.545
 
LowSloppyPhrase   71.11  (1.9%)   70.87  
(2.2%)   -0.3% (  -4% -3%) 0.599
 
   OrNotHighLow  666.63  (1.5%)  664.60  
(1.6%)   -0.3% (  -3% -2%) 0.539
 
MedSloppyPhrase   51.20  (4.6%)   51.07  
(5.1%)   -0.3% (  -9% -9%) 0.869
 
  OrHighNotHigh  356.18  (4.7%)  355.40  
(4.7%)   -0.2% (  -9% -9%) 0.882
 
  HighTermMonthSort 3590.99  (1.1%) 3584.45  
(0.9%)   -0.2% (  -2% -1%) 0.576
   

[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub


mikemccand commented on PR #12535:
URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710238828

   Thanks @uschindler and @rmuir -- I will restore the 30s timeout, try to 
improve logging on client and server errors (so we can figure out WTF happened 
that caused client not to connect in 30s window in future build failures), 
increase client connect timeout from 500ms -> 3s.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub


jimczi commented on PR #12529:
URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710241290

   I merged with the latest changes in main, the new random vector scorer 
integrates nicely with the changes added 
`https://github.com/apache/lucene/pull/12480`. The only difference is that the 
scorer is now exposed in the API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub


mikemccand commented on PR #12535:
URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710320639

   OK I made the changes!
   
   I also manually tested two failure modes of the clients: taking too long to 
initially connect, and throwing some sort of exception after testing the lock N 
times.  In both cases the test stderr showed the stderr/out from the clients, 
hopefully making it easier to test in the future.
   
   I did notice one odd thing: on test failure, I seemed to have a leftover 
`test.lock` in the root directory of the checkout, which is very odd.  The test 
creates a new temp directory and is supposed to use that directory to create 
its test lock ... so I'm not yet sure how that happened.  Or maybe I am just 
hallucinating ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #12535: LockVerifyServer does not need to reuse addresses nor set accept timeout

2023-09-07 Thread via GitHub


mikemccand commented on PR #12535:
URL: https://github.com/apache/lucene/pull/12535#issuecomment-1710327041

   > I did notice one odd thing: on test failure, I seemed to have a leftover 
test.lock in the root directory of the checkout, which is very odd. The test 
creates a new temp directory and is supposed to use that directory to create 
its test lock ... so I'm not yet sure how that happened. Or maybe I am just 
hallucinating ...
   
   OK nevermind -- I'm pretty sure this was leftover from me manually invoking 
`LockVerifyServer` and `LockStressTest` myself.  Phew.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub


benwtrent commented on PR #12529:
URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710335794

   Well, actually looking at the JFR, I cannot see anything that stands out. 
The percentages of compute time are still VERY similar when building index & 
querying. I may just be detecting noise.
   
   `lucene_candidate` jfr is this PR, `lucene_baseline` jfr is latest main.
   [Archive.zip](https://github.com/apache/lucene/files/12550717/Archive.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-07 Thread via GitHub


dweiss commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710614193

   I like it. These options we currently have are not even expert level, 
they're God-level...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-07 Thread via GitHub


jimczi commented on PR #12529:
URL: https://github.com/apache/lucene/pull/12529#issuecomment-1710682508

   Thanks for running the benchmarks @benwtrent . I agree that the difference 
seems to be in the noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-07 Thread via GitHub


Tony-X commented on PR #12541:
URL: https://github.com/apache/lucene/pull/12541#issuecomment-1710736645

   @mikemccand sure. Thanks for pointing out the useful `tidy` target :) I got 
it fixed 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] madrob commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-07 Thread via GitHub


madrob commented on issue #12542:
URL: https://github.com/apache/lucene/issues/12542#issuecomment-1710800611

   What's the impact of having a non-minimal FST? Longer query times? Is that 
something that gets dwarfed by having multiple segments anyway? Maybe different 
merge policies have different defaults - when using tiered merges we can have 
some slack and when merging everything down to a single segment we probably 
should take the time to ensure minimality anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] elliotzlin commented on pull request #1069: [LUCENE-2587] Highlighter fragment bug

2023-09-07 Thread via GitHub


elliotzlin commented on PR #1069:
URL: https://github.com/apache/lucene/pull/1069#issuecomment-1711167476

   @dsmiley apologies for my delay in getting back to your comment! I don't 
have any qualms about refactoring to deter people from using this. I took up 
this ticket more so to get involved with contributing to the Lucene project and 
found this in the backlog, and less so because I was using the Highlighter in a 
project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org