[GitHub] [lucene] javanna commented on issue #12498: Simplify task executor for concurrent operations

2023-08-17 Thread via GitHub


javanna commented on issue #12498:
URL: https://github.com/apache/lucene/issues/12498#issuecomment-1681815467

   Thanks for the feedback everyone, there's a PR up as a first step (does not 
include single slice offloading yet): 
https://github.com/apache/lucene/pull/12499 . Reviews are welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice opened a new pull request, #12512: remove unused variable

2023-08-17 Thread via GitHub


easyice opened a new pull request, #12512:
URL: https://github.com/apache/lucene/pull/12512

   ### Description
   
   
   
   The variable `scratch2` is never used


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12512: Remove unused variable in BKDWriter

2023-08-17 Thread via GitHub


benwtrent commented on PR #12512:
URL: https://github.com/apache/lucene/pull/12512#issuecomment-1682335093

   From what I can tell, it stopped being used after this commit: 
https://github.com/apache/lucene/commit/5d1d6448b91abdebbc68b54b062340c5baa88f4c#diff-12a84802489e3d06f69565f493e8f1b19147e26b1bbb995551a58351ff6f8c92
   
   @iverase does this removal seem cool? I agree, it doesn't look like its used 
anywhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mbrette commented on issue #12440: Make HNSW merges faster

2023-08-17 Thread via GitHub


mbrette commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-1682522530

   What is your take on existing merge optimization #12050 ?
   The approach seems very effective (*), however it does not work if there are 
deleted documents, which is likely to happen in most usage scenario.
   
   (*) To assess how effective it is, I ran a small test.
   I ran a unit test using HnswGraphBuilder and OnHeapHnswGraph, simulating a 
big segment of 1M vector (128 dim) merging with a smaller segment of 10k 
vectors. 
   If I compare initializing the HNSW graph from the big segment, vs. 
rebuilding the graph, it is 100 times faster - even though initializing the 
HNSW graph recompute neighbors scores.
   A given vector/document will go through many segment merge in its life time, 
so the benefit of this optimization accrue a lot.
   Caveat: I used random vectors.
   
   Otherwise, naive question: have we consider integrating other - native - 
libraries (faiss, raft, nmslib...) like what is done in open search (at a 
higher abstraction level though).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #12440: Make HNSW merges faster

2023-08-17 Thread via GitHub


benwtrent commented on issue #12440:
URL: https://github.com/apache/lucene/issues/12440#issuecomment-1682640430

   > What is your take on existing merge optimization 
https://github.com/apache/lucene/pull/12050?
   
   I think its a good start. One problem I have is that typical Lucene segment 
merging attempts to merge segments of equal size together. Making "segment 
tiers" of similar sizes. Maybe an adequate "make merges faster" optimization is 
to have a better segment merge policy that takes advantage of the inherit 
advantages with this optimization.
   
   > A given vector/document will go through many segment merge in its life 
time, so the benefit of this optimization accrue a lot.
   Caveat: I used random vectors.
   
   This is true, but it depends on the merge policy and which segments are 
merged. When merging 10 segments that are of equal size, this optimization has 
almost no impact.
   
   > have we consider integrating other - native - libraries (faiss, raft, 
nmslib...) like what is done in open search (at a higher abstraction level 
though).
   
   I am unsure about this. A new codec could be made integrating those native 
libraries, but they should fit within the Lucene segment model and not use JNI. 
From what I can tell, those integrations don't do either of those things.
   
   Additionally, there shouldn't be any external dependencies (if directly 
integrated into the Lucene repo). See Discussion: 
https://github.com/apache/lucene/issues/12502
   
   
   
   Other options for "making merges faster" is to just provide scalar 
quantization for users. This will make merges as a whole faster as the 
computations required will be much cheaper.
   
   It bugs me that we have all this distributed work across segments that just 
gets ignored. No matter if this was a native implementation or not, merging 
similarly sized HNSW graphs from 9 segments into 1 will still be costly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X opened a new issue, #12513: Try out a simpler term dictionary

2023-08-17 Thread via GitHub


Tony-X opened a new issue, #12513:
URL: https://github.com/apache/lucene/issues/12513

   ### Description
   
   Hello! 
   
   I've been working on a 
[benchmark](https://tony-x.github.io/search-benchmark-game/) for a while to 
compare the features and performance of Lucene and 
[Tantivy](https://github.com/quickwit-oss/tantivy), a rust search engine 
library which was heavily inspired by Lucene.
   
   The benchmark uses the corpus and queries from luceneutil (the framework for 
Lucene nightly bench). Since not all query types are supported by Tantivy, 
currently it focuses on Term/Boolean/PhraseQuery. Tantivy in general showed 
performance advantages for now and I got motivated to understand why.
   
   I documented the two engines' inverted index implementations per my 
understanding. Here is the 
[wiki](https://github.com/Tony-X/search-benchmark-game/wiki/Inverted-index-deep-dive).
 Specifically, both engines use FST to aid the term lookup but the way they use 
them are quite different. In summary, Lucene uses FST to map term prefixes 
followed by scanning the on-disk blocks of terms. Tantivy uses FST to maps all 
the terms to their ordinals and use that ordinal/index to decode at most one 
full block. 
   
   The proposal here is to try Tantivy's term dictionary which I can see some 
advantages
   1. it can determine a term does not existing with only FST operations.
   2. decoding less terms in worst case (a term within a large gap between two 
prefixes)
   3. it is simpler? (might be subjective, but it took me days to digest 
[Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html)
 and I'm still not sure I got every bits correct...)
   
   
   What do you think?  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

2023-08-17 Thread via GitHub


Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1682810659

   Copy paste the relevant wiki section about tantivy's TermDictionary
   
   ---
   
   ## tantivy TermDictionary
   
   Tantivy's term dictionary has two components. An FST that encodes `term -> 
term_ordinal (u64)` where term ordinal is the index of the term in the sorted 
order. The `TermInfoStore` maps the ordinal to the postings metadata. 
Conceptually, `TermInfoStore` is just a big vector of postings metadata ordered 
by the term ordinal. Internally, it applies additional compression but still 
offers random access. 
   
   ### TermInfoStore encodings
   
   ```
   
   
│
   Metadata Section │  Blocks
 ┌───┬───┬──┬───┬───┼───┬──┬─┐
 │   │   │  │   │   │   │  │ │
 │   │   │  │   │   │   │  │ │
 │ 1 │   │   ... ...│n-1│ n │  block 1  │ ... ...  │ block n │
 │   │   │  │   │   │   │  │ │
 └───┴───┴──┴───┴───┼───┴──┴─┘
│
   n fix-sized record   │
│
   ```
   
   
   The metadata record contains all information needed to decode the 
corresponding block. 
   
   ```rust
   struct TermInfoBlockMeta {
   offset: u64,
   ref_term_info: TermInfo,
   doc_freq_nbits: u8,
   postings_offset_nbits: u8,
   positions_offset_nbits: u8,
   }
   
   pub struct TermInfo {
   /// Number of documents in the segment containing the term
   pub doc_freq: u32,
   /// Byte range of the posting list within the postings (`.idx`) file.
   pub postings_range: Range,
   /// Byte range of the positions of this terms in the positions (`.pos`) 
file.
   pub positions_range: Range,
   }
   ```
   * offset: the start offset of the data block in the block section
   * ref_term_info: the reference `TermInfo` on top of which the delta is 
applied
   * *_nbits: the bit-width used to bit-unpack freq and postings/position 
offsets in the data block.
   
   At search time, both FST and the `TermInfoStore` are loaded into memory. To 
search for a term, it first consults the FST to get back the term ordinal if it 
exists. Note that here the FST contains all the terms so the lookup process can 
tell if a term does not exist. The term ordinal is used to determine which 
metadata record and that terms relative offset within the data block by simply 
modulo the (fixed) block size. Then it decodes the metadata record which helps 
to locate the data block and decode the `TermInfo` (postings metadata).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12499: Simplify task executor for concurrent operations

2023-08-17 Thread via GitHub


javanna commented on PR #12499:
URL: https://github.com/apache/lucene/pull/12499#issuecomment-1682812371

   Thanks for looking @reta ! i plan on merging this soon, if anyone else would 
like to review, please go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] reta commented on pull request #12499: Simplify task executor for concurrent operations

2023-08-17 Thread via GitHub


reta commented on PR #12499:
URL: https://github.com/apache/lucene/pull/12499#issuecomment-1682819596

   @sohami fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sohami commented on pull request #12499: Simplify task executor for concurrent operations

2023-08-17 Thread via GitHub


sohami commented on PR #12499:
URL: https://github.com/apache/lucene/pull/12499#issuecomment-1683053853

   > @sohami fyi
   
   Thanks @reta  for tagging me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sohami commented on pull request #12499: Simplify task executor for concurrent operations

2023-08-17 Thread via GitHub


sohami commented on PR #12499:
URL: https://github.com/apache/lucene/pull/12499#issuecomment-1683057195

   @javanna For single slice case are we thinking to leave it as is or change 
that as well to offload to the provided `executor` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org