easyice commented on issue #12717: URL: https://github.com/apache/lucene/issues/12717#issuecomment-1793763101
I reproduced this using low cardinality fields, for instance we let the posing size be 100, write 10 million docs then force merge to single segment, use `TermInSetQuery` with 512 terms as search benchmark, the flame graph shows the `readVIntBlock` accounted for 27%.  I implemented the first version of Group-varint, the decoding process is : * `input.readByte()` for the flag * `input.readBytes()` to buffer for the int bytes * `BitUtil.VH_LE_INT.get(buffer, off) & MASKS` to get the value but the benchmark has a bit regression, it seems `readBytes()->Unsafe.copyMemory()` is slow  So i fully read the posting to buffer at first, then the performance improved ~18% for posting size be 50, the .doc file increased by ~9%. i also try to use `bkd.DocIdsWriter#writeDocIds` to encode the docs, the search performance has improved are similar to that. benchmark code: ``` public class SingleBlockPosting { private static final boolean INDEX = true; private static final boolean SEARCH = true; private static final int BENCHMARK_ITERATION = 10; private static final long SEED = 3; private static final String FIELD = "f1"; private static final int numDocs = 1000_0000; private static final int postingSize = 100; private static int cardinality = numDocs / postingSize; public static Long[] randomLongs() { Random rand = new Random(SEED); HashSet<Long> setLongs = new HashSet<>(); while (setLongs.size() < cardinality) { setLongs.add(rand.nextLong()); } return setLongs.toArray(new Long[0]); } public static void index() throws IOException { Long[] longs = randomLongs(); Directory dir = MMapDirectory.open(Paths.get("/Volumes/RamDisk/singleblock")); IndexWriterConfig iwc = new IndexWriterConfig(null); iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE); iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH); IndexWriter indexWriter = new IndexWriter(dir, iwc); for (int i = 0; i < numDocs; i++) { Document doc = new Document(); doc.add(new StringField(FIELD, String.valueOf(longs[i % cardinality]), Field.Store.NO)); indexWriter.addDocument(doc); } indexWriter.commit(); indexWriter.forceMerge(1); indexWriter.close(); dir.close(); } public static void doSearchBenchMark(int termCount) throws IOException { List<Long> times = new ArrayList<>(); for (int i = 0; i < BENCHMARK_ITERATION; i++) { times.add(doSearch(termCount)); } long took = times.stream().mapToLong(Number::longValue).min().getAsLong(); System.out.println("best result: term count: " + termCount + ", took(ms): " + took); } public static long doSearch(int termCount) throws IOException { Directory directory = FSDirectory.open(Paths.get("/Volumes/RamDisk/singleblock")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(indexReader); searcher.setQueryCachingPolicy( new QueryCachingPolicy() { @Override public void onUse(Query query) {} @Override public boolean shouldCache(Query query) throws IOException { return false; } }); long total = 0; Query query = getQuery(termCount); for (int i = 0; i < 1000; i++) { long start = System.currentTimeMillis(); doQuery(searcher, query); long end = System.currentTimeMillis(); total += end - start; } System.out.println("term count: " + termCount + ", took(ms): " + total); indexReader.close(); directory.close(); return total; } private static Query getQuery(int termCount) { List<BytesRef> terms = new ArrayList<>(); Long[] longs = randomLongs(); for (int i = 0; i < termCount; i++) { terms.add(new BytesRef(Long.toString(longs[i % cardinality]))); } return new TermInSetQuery(FIELD, terms); } private static void doQuery(IndexSearcher searcher, Query query) throws IOException { TotalHitCountCollectorManager collectorManager = new TotalHitCountCollectorManager(); int totalHits = searcher.search(query, collectorManager); // System.out.println(totalHits); } public static void main(String[] args) throws IOException { if (INDEX) { index(); } if (SEARCH) { doSearchBenchMark(512); } } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org