Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

via GitHub Sun, 05 Nov 2023 07:07:13 -0800


easyice commented on issue #12717:
URL: https://github.com/apache/lucene/issues/12717#issuecomment-1793763101


   I reproduced this using low cardinality fields, for instance we let the 
posing size be 100, write 10 million docs then force merge to single segment, 
use `TermInSetQuery` with 512 terms as search benchmark, the flame graph shows 
the `readVIntBlock` accounted for 27%.
   
   
![image](https://github.com/apache/lucene/assets/23521001/487fca37-6e2a-4582-8b75-d8fee5f344ef)
   
   I implemented the first version of Group-varint, the decoding process is :
   * `input.readByte()` for the flag
   * `input.readBytes()` to buffer for the int bytes
   * `BitUtil.VH_LE_INT.get(buffer, off) & MASKS` to get the value
   
   but the benchmark has a bit regression, it seems 
`readBytes()->Unsafe.copyMemory()` is slow
   
![image](https://github.com/apache/lucene/assets/23521001/b284f55a-88fa-4413-b71d-526f68fdd4e0)
   
   So i fully read the posting to buffer at first, then the performance 
improved ~18% for posting size be 50, the .doc file  increased by ~9%. i also 
try to use `bkd.DocIdsWriter#writeDocIds` to encode the docs, the search 
performance has improved  are similar to that.
   
   benchmark code:
   ```
   public class SingleBlockPosting {
     private static final boolean INDEX = true;
     private static final boolean SEARCH = true;
     private static final int BENCHMARK_ITERATION = 10;
     private static final long SEED = 3;
     private static final String FIELD = "f1";
   
     private static final int numDocs = 1000_0000;
     private static final int postingSize = 100;
     private static int cardinality = numDocs / postingSize;
   
     public static Long[] randomLongs() {
       Random rand = new Random(SEED);
       HashSet<Long> setLongs = new HashSet<>();
       while (setLongs.size() < cardinality) {
         setLongs.add(rand.nextLong());
       }
       return setLongs.toArray(new Long[0]);
     }
   
     public static void index() throws IOException {
       Long[] longs = randomLongs();
       Directory dir = 
MMapDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
       IndexWriterConfig iwc = new IndexWriterConfig(null);
       iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
       iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
       IndexWriter indexWriter = new IndexWriter(dir, iwc);
       for (int i = 0; i < numDocs; i++) {
         Document doc = new Document();
         doc.add(new StringField(FIELD, String.valueOf(longs[i % cardinality]), 
Field.Store.NO));
         indexWriter.addDocument(doc);
       }
       indexWriter.commit();
       indexWriter.forceMerge(1);
       indexWriter.close();
       dir.close();
     }
   
     public static void doSearchBenchMark(int termCount) throws IOException {
       List<Long> times = new ArrayList<>();
       for (int i = 0; i < BENCHMARK_ITERATION; i++) {
         times.add(doSearch(termCount));
       }
       long took = 
times.stream().mapToLong(Number::longValue).min().getAsLong();
       System.out.println("best result: term count: " + termCount + ", 
took(ms): " + took);
     }
   
     public static long doSearch(int termCount) throws IOException {
       Directory directory = 
FSDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
       IndexReader indexReader = DirectoryReader.open(directory);
       IndexSearcher searcher = new IndexSearcher(indexReader);
       searcher.setQueryCachingPolicy(
           new QueryCachingPolicy() {
             @Override
             public void onUse(Query query) {}
   
             @Override
             public boolean shouldCache(Query query) throws IOException {
               return false;
             }
           });
   
       long total = 0;
       Query query = getQuery(termCount);
       for (int i = 0; i < 1000; i++) {
         long start = System.currentTimeMillis();
         doQuery(searcher, query);
         long end = System.currentTimeMillis();
         total += end - start;
       }
       System.out.println("term count: " + termCount + ", took(ms): " + total);
       indexReader.close();
       directory.close();
       return total;
     }
   
     private static Query getQuery(int termCount) {
       List<BytesRef> terms = new ArrayList<>();
       Long[] longs = randomLongs();
       for (int i = 0; i < termCount; i++) {
         terms.add(new BytesRef(Long.toString(longs[i % cardinality])));
       }
       return new TermInSetQuery(FIELD, terms);
     }
   
     private static void doQuery(IndexSearcher searcher, Query query) throws 
IOException {
       TotalHitCountCollectorManager collectorManager = new 
TotalHitCountCollectorManager();
       int totalHits = searcher.search(query, collectorManager);
       // System.out.println(totalHits);
     }
   
     public static void main(String[] args) throws IOException {
       if (INDEX) {
         index();
       }
       if (SEARCH) {
         doSearchBenchMark(512);
       }
     }
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE [lucene]

Reply via email to