[GitHub] [lucene] Tony-X opened a new issue, #12513: Try out a simpler term dictionary

via GitHub Thu, 17 Aug 2023 11:59:37 -0700


Tony-X opened a new issue, #12513:
URL: https://github.com/apache/lucene/issues/12513

### Description

Hello!

I've been working on a
[benchmark](https://tony-x.github.io/search-benchmark-game/) for a while to
compare the features and performance of Lucene and
[Tantivy](https://github.com/quickwit-oss/tantivy), a rust search engine
library which was heavily inspired by Lucene.

The benchmark uses the corpus and queries from luceneutil (the framework for
Lucene nightly bench). Since not all query types are supported by Tantivy,
currently it focuses on Term/Boolean/PhraseQuery. Tantivy in general showed
performance advantages for now and I got motivated to understand why.

I documented the two engines' inverted index implementations per my
understanding. Here is the
[wiki](https://github.com/Tony-X/search-benchmark-game/wiki/Inverted-index-deep-dive).
Specifically, both engines use FST to aid the term lookup but the way they use
them are quite different. In summary, Lucene uses FST to map term prefixes
followed by scanning the on-disk blocks of terms. Tantivy uses FST to maps all
the terms to their ordinals and use that ordinal/index to decode at most one
full block.

The proposal here is to try Tantivy's term dictionary which I can see some
advantages
1. it can determine a term does not existing with only FST operations.
2. decoding less terms in worst case (a term within a large gap between two
prefixes)
3. it is simpler? (might be subjective, but it took me days to digest
[Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html)
and I'm still not sure I got every bits correct...)

What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Tony-X opened a new issue, #12513: Try out a simpler term dictionary

Reply via email to