[GitHub] [lucene] javanna commented on issue #12498: Simplify task executor for concurrent operations
javanna commented on issue #12498: URL: https://github.com/apache/lucene/issues/12498#issuecomment-1681815467 Thanks for the feedback everyone, there's a PR up as a first step (does not include single slice offloading yet): https://github.com/apache/lucene/pull/12499 . Reviews are welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice opened a new pull request, #12512: remove unused variable
easyice opened a new pull request, #12512: URL: https://github.com/apache/lucene/pull/12512 ### Description The variable `scratch2` is never used -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12512: Remove unused variable in BKDWriter
benwtrent commented on PR #12512: URL: https://github.com/apache/lucene/pull/12512#issuecomment-1682335093 From what I can tell, it stopped being used after this commit: https://github.com/apache/lucene/commit/5d1d6448b91abdebbc68b54b062340c5baa88f4c#diff-12a84802489e3d06f69565f493e8f1b19147e26b1bbb995551a58351ff6f8c92 @iverase does this removal seem cool? I agree, it doesn't look like its used anywhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mbrette commented on issue #12440: Make HNSW merges faster
mbrette commented on issue #12440: URL: https://github.com/apache/lucene/issues/12440#issuecomment-1682522530 What is your take on existing merge optimization #12050 ? The approach seems very effective (*), however it does not work if there are deleted documents, which is likely to happen in most usage scenario. (*) To assess how effective it is, I ran a small test. I ran a unit test using HnswGraphBuilder and OnHeapHnswGraph, simulating a big segment of 1M vector (128 dim) merging with a smaller segment of 10k vectors. If I compare initializing the HNSW graph from the big segment, vs. rebuilding the graph, it is 100 times faster - even though initializing the HNSW graph recompute neighbors scores. A given vector/document will go through many segment merge in its life time, so the benefit of this optimization accrue a lot. Caveat: I used random vectors. Otherwise, naive question: have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #12440: Make HNSW merges faster
benwtrent commented on issue #12440: URL: https://github.com/apache/lucene/issues/12440#issuecomment-1682640430 > What is your take on existing merge optimization https://github.com/apache/lucene/pull/12050? I think its a good start. One problem I have is that typical Lucene segment merging attempts to merge segments of equal size together. Making "segment tiers" of similar sizes. Maybe an adequate "make merges faster" optimization is to have a better segment merge policy that takes advantage of the inherit advantages with this optimization. > A given vector/document will go through many segment merge in its life time, so the benefit of this optimization accrue a lot. Caveat: I used random vectors. This is true, but it depends on the merge policy and which segments are merged. When merging 10 segments that are of equal size, this optimization has almost no impact. > have we consider integrating other - native - libraries (faiss, raft, nmslib...) like what is done in open search (at a higher abstraction level though). I am unsure about this. A new codec could be made integrating those native libraries, but they should fit within the Lucene segment model and not use JNI. From what I can tell, those integrations don't do either of those things. Additionally, there shouldn't be any external dependencies (if directly integrated into the Lucene repo). See Discussion: https://github.com/apache/lucene/issues/12502 Other options for "making merges faster" is to just provide scalar quantization for users. This will make merges as a whole faster as the computations required will be much cheaper. It bugs me that we have all this distributed work across segments that just gets ignored. No matter if this was a native implementation or not, merging similarly sized HNSW graphs from 9 segments into 1 will still be costly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Tony-X opened a new issue, #12513: Try out a simpler term dictionary
Tony-X opened a new issue, #12513: URL: https://github.com/apache/lucene/issues/12513 ### Description Hello! I've been working on a [benchmark](https://tony-x.github.io/search-benchmark-game/) for a while to compare the features and performance of Lucene and [Tantivy](https://github.com/quickwit-oss/tantivy), a rust search engine library which was heavily inspired by Lucene. The benchmark uses the corpus and queries from luceneutil (the framework for Lucene nightly bench). Since not all query types are supported by Tantivy, currently it focuses on Term/Boolean/PhraseQuery. Tantivy in general showed performance advantages for now and I got motivated to understand why. I documented the two engines' inverted index implementations per my understanding. Here is the [wiki](https://github.com/Tony-X/search-benchmark-game/wiki/Inverted-index-deep-dive). Specifically, both engines use FST to aid the term lookup but the way they use them are quite different. In summary, Lucene uses FST to map term prefixes followed by scanning the on-disk blocks of terms. Tantivy uses FST to maps all the terms to their ordinals and use that ordinal/index to decode at most one full block. The proposal here is to try Tantivy's term dictionary which I can see some advantages 1. it can determine a term does not existing with only FST operations. 2. decoding less terms in worst case (a term within a large gap between two prefixes) 3. it is simpler? (might be subjective, but it took me days to digest [Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html) and I'm still not sure I got every bits correct...) What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format
Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1682810659 Copy paste the relevant wiki section about tantivy's TermDictionary --- ## tantivy TermDictionary Tantivy's term dictionary has two components. An FST that encodes `term -> term_ordinal (u64)` where term ordinal is the index of the term in the sorted order. The `TermInfoStore` maps the ordinal to the postings metadata. Conceptually, `TermInfoStore` is just a big vector of postings metadata ordered by the term ordinal. Internally, it applies additional compression but still offers random access. ### TermInfoStore encodings ``` │ Metadata Section │ Blocks ┌───┬───┬──┬───┬───┼───┬──┬─┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 1 │ │ ... ...│n-1│ n │ block 1 │ ... ... │ block n │ │ │ │ │ │ │ │ │ │ └───┴───┴──┴───┴───┼───┴──┴─┘ │ n fix-sized record │ │ ``` The metadata record contains all information needed to decode the corresponding block. ```rust struct TermInfoBlockMeta { offset: u64, ref_term_info: TermInfo, doc_freq_nbits: u8, postings_offset_nbits: u8, positions_offset_nbits: u8, } pub struct TermInfo { /// Number of documents in the segment containing the term pub doc_freq: u32, /// Byte range of the posting list within the postings (`.idx`) file. pub postings_range: Range, /// Byte range of the positions of this terms in the positions (`.pos`) file. pub positions_range: Range, } ``` * offset: the start offset of the data block in the block section * ref_term_info: the reference `TermInfo` on top of which the delta is applied * *_nbits: the bit-width used to bit-unpack freq and postings/position offsets in the data block. At search time, both FST and the `TermInfoStore` are loaded into memory. To search for a term, it first consults the FST to get back the term ordinal if it exists. Note that here the FST contains all the terms so the lookup process can tell if a term does not exist. The term ordinal is used to determine which metadata record and that terms relative offset within the data block by simply modulo the (fixed) block size. Then it decodes the metadata record which helps to locate the data block and decode the `TermInfo` (postings metadata). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] javanna commented on pull request #12499: Simplify task executor for concurrent operations
javanna commented on PR #12499: URL: https://github.com/apache/lucene/pull/12499#issuecomment-1682812371 Thanks for looking @reta ! i plan on merging this soon, if anyone else would like to review, please go ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #12499: Simplify task executor for concurrent operations
reta commented on PR #12499: URL: https://github.com/apache/lucene/pull/12499#issuecomment-1682819596 @sohami fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sohami commented on pull request #12499: Simplify task executor for concurrent operations
sohami commented on PR #12499: URL: https://github.com/apache/lucene/pull/12499#issuecomment-1683053853 > @sohami fyi Thanks @reta for tagging me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sohami commented on pull request #12499: Simplify task executor for concurrent operations
sohami commented on PR #12499: URL: https://github.com/apache/lucene/pull/12499#issuecomment-1683057195 @javanna For single slice case are we thinking to leave it as is or change that as well to offload to the provided `executor` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org