easyice commented on PR #12842: URL: https://github.com/apache/lucene/pull/12842#issuecomment-1856293777
Sorry for the late update! i spent some more time on other PR, i encoded the positions with group-varint when `storeOffsets` is false and there are no payloads. with the last commit, it uses a long[] buffer with 128 size to encode/decode. i wrote a simple benchmark to show `flush()` performance, it seems no significant performance improvement, because `readVInt` and `readGroupVInt` have similar performance in `ByteBuffersDataOutput` on current branch, i'll test it with https://github.com/apache/lucene/pull/12841 optimized code tomorrow. The simple benchmark summary: * using 200 terms per field. * freq per term set to 100. that means, the cardinality of a field is 2.(the group-varint encoding of the positions does not cross doc boundaries) * 10000 docs total. <details> <summary >Benchmark code</summary> ```java public class SortedStringWriteBenchmark { static class Benchark { Random rand = new Random(0); String randomString(int termsPerField, int freqPerTerm) { List<String> values = new ArrayList<>(); for (int i = 0; i < termsPerField; ) { String s = TestUtil.randomSimpleString(rand, 5, 10); for (int j = 0; j < freqPerTerm; j++) { values.add(s); } i += freqPerTerm; } Collections.shuffle(values); String text = String.join(" ", values); return text; } List<String> randomStrings(int max, int termsPerField, int freqPerTerm) { List<String> values = new ArrayList<>(); for (int i = 0; i < max; i++) { values.add(randomString(termsPerField, freqPerTerm)); } return values; } long write() throws IOException { List<String> terms = randomStrings(10000, 200, 100); Path temp = Files.createTempDirectory(Paths.get("/Volumes/RamDisk"), "tmpDirPrefix"); Directory dir = MMapDirectory.open(temp); IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer()); config.setIndexSort(new Sort(new SortField("sort", SortField.Type.LONG))); config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH); IndexWriter w = new IndexWriter(dir, config); FieldType ft = new FieldType(); ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); ft.setTokenized(true); ft.freeze(); for (int i = 0; i < terms.size(); ++i) { Document doc = new Document(); doc.add(new NumericDocValuesField("sort", rand.nextInt())); doc.add(new TextField("field", terms.get(i), Field.Store.NO)); w.addDocument(doc); } long t0 = System.currentTimeMillis(); w.flush(); long took = System.currentTimeMillis() - t0; w.close(); dir.close(); return took; } } public static void main(final String[] args) throws Exception { int iter = 50; Benchark benchark = new Benchark(); List<Long> times = new ArrayList<>(); for (int i = 0; i < iter; i++) { long took = benchark.write(); times.add(took); System.out.println("iteration " + i + ",took(ms):" + took); } double avg = times.stream().skip(iter / 2).mapToLong(Number::longValue).average().getAsDouble(); long min = times.stream().mapToLong(Number::longValue).min().getAsLong(); System.out.println("best took(ms) avg:" + avg + ", min:" + min); } ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org