mikemccand closed issue #12696: Adding option to codec to disable patching in
Lucene's PFOR encoding
URL: https://github.com/apache/lucene/issues/12696
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go t
slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1790638953
> Another exciting optimization such a "patch-less" encoding could implement
is within-block skipping (I believe Tantivy does this).
>
> Today, our skipper is forced to align t
slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1787521225
Thanks all for the feedback. Will proceed with removing patching only for
doc blocks (reverting some of https://github.com/apache/lucene/pull/69)
All the changes needed to crea
jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1786969801
> Normally the IntNRQ (1D points numeric range query) is very noisy, but
maybe this gain is real? p-value seems to think it could be close to real?
I'm not sure how it could n
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1786949852
Thanks for testing @jpountz.
I think at some point we also enabled patching for the freq blocks inside
`.doc` file too?
Normally the `IntNRQ` (1D points numeric rang
jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1782814872
FWIW I could reproduce the speedup from disabling patching locally on
wikibigall:
```
TaskQPS baseline StdDevQPS
my_modified_version
jpountz commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1779221543
For reference, Lucene used to use FOR for postings and PFOR for positions in
8.x. This was changed in 9.0 via #69 to use PFOR for both postings and
positions. This PR says it made t
Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775993940
> would the goal here be to eliminate overhead of having to read the number
of patches when decoding each block?
Yes. This means we could know upfront at segment opening time w
gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775871779
> Maybe write something in the index header to indicate if patching is there
(default to yes - in 9.x ). Then new indexes will write additional header to
indicate there is not patc
Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775807115
> In 11.0, remove all patching logic which will, a) simplify the code a bit,
and b) remove the (likely minor) overhead on read of looking up the number of
patches in a block, which i
gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775725064
> +1. I recalled that @gsmiller was playing with some SIMD algos for
decoding blocks of delta-encoded ints. Even if that is fruitful it'd be tricky
to apply it because of the patch
msokolov commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775716306
> Hmm, can you elaborate how it can be fully backwards-compatible on with
the indexes that have patching?
I think the idea is that because we always maintain readers that can
gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775717147
I like the idea of removing the complexity associated with patching if we're
convinced it's the right trade-off (and +1 to the pain of vectorizing with
patching going away).
Tony-X commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775698779
> It is a lot of complexity, especially to vectorize.
+1. I recalled that @gsmiller was playing with some SIMD algos for decoding
blocks of delta-encoded ints. Even if that is
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775638986
> Are there any additional corpora that we should also test this with?
Maybe the NYC taxis? This is a more sparse, and tiny docs (vs dense and
medium/large docs in `enwiki
slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1775353027
If we want to remove the patching entirely, which Lucene version (and which
Codec) should we implement this in? Would this be a potential change for Lucene
9.9 or perhaps 10.0?
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1774236604
> Should we just do more tests and start writing indexes without patching?
Only a 4 percent disk savings? It is a lot of complexity, especially to
vectorize. A runtime option is
rmuir commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1773935712
Should we just do more tests and start writing indexes without patching?
Only a 4 percent disk savings? It is a lot of complexity, especially to
vectorize. A runtime option is more ex
slow-J commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1771275256
>Did you turn off patching for all encoded int[] blocks (docs, freqs,
positions)?
Yes, I think so. All uses of `pforUtil` in the postingsReader and writer
were replaced with t
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770461719
Another exciting optimization such a "patch-less" encoding could implement
is within-block skipping (I believe Tantivy does this).
Today, our skipper is forced to align to
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770457453
> Posting results below
The results are impressive! Conjunctive (-like) queries see sizable gains.
Did you turn off patching for all encoded `int[]` blocks (docs, fr
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770452771
That's a neat idea (separate codec that trades off index size for faster
search performance). Maybe it could also fold in the [fully in RAM FST term
dictionary](https://github.c
gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1769095158
These results are really interesting! As another option, I wonder if it's
worth thinking about this problem as a new codec (sandbox module to start?)
that biases towards query spee
slow-J opened a new issue, #12696:
URL: https://github.com/apache/lucene/issues/12696
### Description
Background: In https://github.com/Tony-X/search-benchmark-game we were
comparing performance of Tantivy and Lucene. "One difference between Lucene and
Tantivy is Lucene uses the "pat
24 matches
Mail list logo