[ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003640#comment-17003640 ]
Adrien Grand commented on LUCENE-4702: -------------------------------------- I finally explored a different path: JDK13 added more auto-vectorization optimizations on byte[] arrays, so I wanted to look into whether we could leverage it for compression. I ended up with a few lines of code that can encode/decode byte[] arrays with a compression ratio of ~75%, if most (there is support for exceptions) bytes are either in the [0x1F,0x3F) or [0x5F,0x7F) ranges, which notably include all digits, lowercase characters, '.', '-' and '_'. So it should be applicable most of the time to terms dictionaries of analyzed content. It already helps on our nightly benchmarks, even though very little normalization is performed (e.g. no ascii folding). It is usually faster than LZ4 for short sequences of text (several times faster on JDK13+, and a bit faster on previous JDKs), like our blocks of suffixes. LZ4's ability to remove duplicate strings is still helpful, but since it hurts multi-term queries I only enabled it when it yields compression ratios that are less than 75%. I got the following results on a force-merged wikibigall (note that results are not comparable at all with previous results on this issue, since this is a different dataset and that there have been many other changes in Lucene that affect theses benchmarks, especially the fact that benchmarks now only count 1,000 hits): {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff Respell 164.33 (6.7%) 140.08 (4.3%) -14.8% ( -24% - -4%) Fuzzy2 108.19 (7.7%) 101.51 (6.6%) -6.2% ( -19% - 8%) Wildcard 94.23 (2.8%) 88.42 (2.6%) -6.2% ( -11% - 0%) Prefix3 247.07 (5.1%) 244.95 (4.0%) -0.9% ( -9% - 8%) TermBGroup1M 24.38 (6.4%) 24.17 (6.3%) -0.8% ( -12% - 12%) TermGroup1M 23.12 (6.6%) 23.02 (6.0%) -0.4% ( -12% - 13%) AndHighHigh 35.88 (4.8%) 35.78 (5.0%) -0.3% ( -9% - 9%) TermGroup10K 45.63 (5.7%) 45.53 (5.4%) -0.2% ( -10% - 11%) SpanNear 10.89 (1.4%) 10.87 (1.5%) -0.2% ( -3% - 2%) SloppyPhrase 19.57 (4.1%) 19.54 (4.1%) -0.1% ( -8% - 8%) Phrase 69.13 (3.5%) 69.05 (3.9%) -0.1% ( -7% - 7%) AndHighMed 50.75 (4.6%) 50.70 (4.6%) -0.1% ( -8% - 9%) IntervalsOrdered 23.97 (0.8%) 23.96 (0.6%) -0.0% ( -1% - 1%) Term 1432.69 (3.8%) 1432.25 (3.7%) -0.0% ( -7% - 7%) AndHighOrMedMed 37.71 (1.7%) 37.72 (1.7%) 0.0% ( -3% - 3%) TermBGroup1M1P 25.61 (3.4%) 25.62 (3.4%) 0.1% ( -6% - 7%) TermDTSort 41.04 (4.9%) 41.06 (4.6%) 0.1% ( -9% - 10%) OrHighMed 35.05 (3.2%) 35.08 (3.4%) 0.1% ( -6% - 6%) AndMedOrHighHigh 34.22 (3.5%) 34.26 (3.7%) 0.1% ( -6% - 7%) TermDayOfYearSort 93.34 (7.6%) 93.60 (7.2%) 0.3% ( -13% - 16%) TermGroup100 15.21 (3.1%) 15.27 (3.0%) 0.4% ( -5% - 6%) TermMonthSort 49.27 (2.7%) 49.53 (2.3%) 0.5% ( -4% - 5%) TermTitleSort 127.41 (2.8%) 128.12 (2.2%) 0.6% ( -4% - 5%) OrHighHigh 10.14 (3.3%) 10.20 (3.5%) 0.6% ( -5% - 7%) Fuzzy1 159.76 (8.2%) 161.68 (6.6%) 1.2% ( -12% - 17%) IntNRQ 266.89 (8.8%) 280.44 (11.6%) 5.1% ( -14% - 27%) {noformat} The hit on {{Respell}} is significant, but on other multi-term queries it looks reasonable to me. It gave a ~9.3% reduction of the {{tim}} file, from 937MB to 850MB. Here are the detailed stats for the "body" field: {noformat} index FST: 72 bytes terms: 46916528 terms 595069147 bytes (12.7 bytes/term) blocks: 1507239 blocks 1158537 terms-only blocks 471 sub-block-only blocks 348231 mixed blocks 318391 floor blocks 491775 non-floor blocks 1015464 floor sub-blocks 359890173 term suffix bytes before compression (196.4 suffix-bytes/block) 296029380 compressed term suffix bytes (0.82 compression ratio - compression count by algorithm: uncompressed:225133, lowercase_ascii:1217151, LZ4:64955) 94426201 term stats bytes (62.6 stats-bytes/block) 236025336 other bytes (156.6 other-bytes/block) by prefix length: 0: 4 1: 403 2: 12500 3: 135458 4: 214723 5: 445741 6: 279299 7: 120403 8: 95046 9: 65611 10: 42914 11: 25225 12: 15910 13: 8865 14: 9029 15: 13485 16: 10549 17: 3412 18: 1234 19: 1003 20: 1197 21: 753 22: 436 23: 510 24: 328 25: 494 26: 396 27: 723 28: 246 29: 310 30: 103 31: 60 32: 58 33: 36 34: 61 35: 83 36: 118 37: 44 38: 48 39: 81 40: 16 41: 29 42: 12 43: 12 44: 44 45: 16 46: 54 47: 18 48: 10 49: 5 50: 6 51: 2 52: 4 53: 13 55: 2 56: 11 57: 6 58: 7 59: 8 60: 2 61: 11 62: 8 63: 8 64: 4 65: 5 66: 7 67: 4 68: 1 69: 1 70: 4 73: 2 74: 1 76: 1 77: 1 78: 2 79: 2 81: 1 {noformat} When I simulate 1M flake IDs with a 1000 docs/s indexing rate, I get the following stats {noformat} index FST: 134007 bytes terms: 1000000 terms 16000000 bytes (16.0 bytes/term) blocks: 39215 blocks 39062 terms-only blocks 153 sub-block-only blocks 0 mixed blocks 3923 floor blocks 1 non-floor blocks 39214 floor sub-blocks 10019627 term suffix bytes before compression (165.6 suffix-bytes/block) 6492123 compressed term suffix bytes (0.65 compression ratio - compression count by algorithm: uncompressed:137, lowercase_ascii:15, LZ4:39063) 1000000 term stats bytes (25.5 stats-bytes/block) 4101135 other bytes (104.6 other-bytes/block) by prefix length: 0: 1 6: 152 7: 39062 {noformat} > Terms dictionary compression > ---------------------------- > > Key: LUCENE-4702 > URL: https://issues.apache.org/jira/browse/LUCENE-4702 > Project: Lucene - Core > Issue Type: Wish > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Trivial > Attachments: LUCENE-4702.patch, LUCENE-4702.patch > > > I've done a quick test with the block tree terms dictionary by replacing a > call to IndexOutput.writeBytes to write suffix bytes with a call to > LZ4.compressHC to test the peformance hit. Interestingly, search performance > was very good (see comparison table below) and the tim files were 14% smaller > (from 150432 bytes overall to 129516). > {noformat} > TaskQPS baseline StdDevQPS compressed StdDev > Pct diff > Fuzzy1 111.50 (2.0%) 78.78 (1.5%) > -29.4% ( -32% - -26%) > Fuzzy2 36.99 (2.7%) 28.59 (1.5%) > -22.7% ( -26% - -18%) > Respell 122.86 (2.1%) 103.89 (1.7%) > -15.4% ( -18% - -11%) > Wildcard 100.58 (4.3%) 94.42 (3.2%) > -6.1% ( -13% - 1%) > Prefix3 124.90 (5.7%) 122.67 (4.7%) > -1.8% ( -11% - 9%) > OrHighLow 169.87 (6.8%) 167.77 (8.0%) > -1.2% ( -15% - 14%) > LowTerm 1949.85 (4.5%) 1929.02 (3.4%) > -1.1% ( -8% - 7%) > AndHighLow 2011.95 (3.5%) 1991.85 (3.3%) > -1.0% ( -7% - 5%) > OrHighHigh 155.63 (6.7%) 154.12 (7.9%) > -1.0% ( -14% - 14%) > AndHighHigh 341.82 (1.2%) 339.49 (1.7%) > -0.7% ( -3% - 2%) > OrHighMed 217.55 (6.3%) 216.16 (7.1%) > -0.6% ( -13% - 13%) > IntNRQ 53.10 (10.9%) 52.90 (8.6%) > -0.4% ( -17% - 21%) > MedTerm 998.11 (3.8%) 994.82 (5.6%) > -0.3% ( -9% - 9%) > MedSpanNear 60.50 (3.7%) 60.36 (4.8%) > -0.2% ( -8% - 8%) > HighSpanNear 19.74 (4.5%) 19.72 (5.1%) > -0.1% ( -9% - 9%) > LowSpanNear 101.93 (3.2%) 101.82 (4.4%) > -0.1% ( -7% - 7%) > AndHighMed 366.18 (1.7%) 366.93 (1.7%) > 0.2% ( -3% - 3%) > PKLookup 237.28 (4.0%) 237.96 (4.2%) > 0.3% ( -7% - 8%) > MedPhrase 173.17 (4.7%) 174.69 (4.7%) > 0.9% ( -8% - 10%) > LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%) > 1.0% ( -4% - 6%) > LowPhrase 374.64 (5.5%) 379.11 (5.8%) > 1.2% ( -9% - 13%) > HighTerm 253.14 (7.9%) 256.97 (11.4%) > 1.5% ( -16% - 22%) > HighPhrase 19.52 (10.6%) 19.83 (11.0%) > 1.6% ( -18% - 25%) > MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%) > 1.6% ( -3% - 6%) > HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%) > 2.8% ( -6% - 13%) > {noformat} > Only queries which are very terms-dictionary-intensive got a performance hit > (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved > (surprisingly) well. > Do you think of it as something worth exploring? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org