[ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024532#comment-17024532 ]
Adrien Grand commented on LUCENE-4702: -------------------------------------- OK I benchmarked with multi-segment indices this time to try to better replicate nightly benchmarks. I opened a pull request at https://github.com/apache/lucene-solr/pull/1216 that: - removes compression of suffix lengths since it didn't help much anymay, - replaces LZ4 on stats by explicit run-length compression - only tries out LZ4 for suffix bytes if the average suffix length is > 6 to reduce index-time overhead since it's unlikely to meet the saving expectations otherwise anyway, in order to reduce index-time overhead On wikibigall, the specialized RLE makes the tim file even smaller with this change (969MB vs. 996MB) and luceneutil seems to be a bit more happy: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff IntNRQ 144.16 (1.2%) 143.47 (1.9%) -0.5% ( -3% - 2%) TermBGroup1M 32.04 (5.1%) 31.93 (5.1%) -0.4% ( -10% - 10%) TermDTSort 39.13 (0.9%) 39.05 (1.0%) -0.2% ( -2% - 1%) TermGroup1M 40.18 (4.0%) 40.12 (3.4%) -0.2% ( -7% - 7%) TermTitleSort 124.62 (1.9%) 124.54 (1.6%) -0.1% ( -3% - 3%) TermDayOfYearSort 88.37 (6.9%) 88.34 (7.1%) -0.0% ( -13% - 14%) TermGroup10K 28.56 (5.0%) 28.56 (4.4%) 0.0% ( -8% - 9%) IntervalsOrdered 4.50 (1.1%) 4.51 (0.6%) 0.0% ( -1% - 1%) TermBGroup1M1P 45.83 (4.1%) 45.85 (4.0%) 0.0% ( -7% - 8%) TermMonthSort 137.33 (1.8%) 137.40 (1.3%) 0.1% ( -2% - 3%) AndHighHigh 72.97 (2.8%) 73.05 (2.7%) 0.1% ( -5% - 5%) OrHighMed 77.75 (2.7%) 77.85 (2.7%) 0.1% ( -5% - 5%) SpanNear 10.66 (1.2%) 10.68 (1.2%) 0.2% ( -2% - 2%) Phrase 59.75 (4.9%) 59.91 (5.2%) 0.3% ( -9% - 10%) Term 1358.87 (6.8%) 1363.02 (6.1%) 0.3% ( -11% - 14%) AndMedOrHighHigh 28.18 (3.0%) 28.27 (2.5%) 0.3% ( -5% - 6%) OrHighHigh 18.55 (3.2%) 18.61 (2.2%) 0.3% ( -4% - 5%) SloppyPhrase 19.41 (3.9%) 19.49 (3.5%) 0.4% ( -6% - 8%) AndHighMed 65.81 (2.8%) 66.15 (2.4%) 0.5% ( -4% - 5%) AndHighOrMedMed 36.49 (2.5%) 36.69 (1.9%) 0.5% ( -3% - 5%) TermGroup100 12.19 (3.9%) 12.27 (4.0%) 0.6% ( -7% - 8%) PKLookup 217.61 (3.2%) 220.39 (3.3%) 1.3% ( -5% - 8%) Prefix3 197.95 (3.3%) 202.32 (3.4%) 2.2% ( -4% - 9%) Wildcard 37.78 (2.2%) 41.43 (2.8%) 9.6% ( 4% - 14%) Fuzzy1 47.77 (5.5%) 53.35 (8.4%) 11.7% ( -2% - 27%) Fuzzy2 43.69 (7.5%) 49.50 (10.7%) 13.3% ( -4% - 34%) Respell 34.05 (1.6%) 41.94 (1.4%) 23.2% ( 19% - 26%) {noformat} I plan to commit it and see how that affects nigthly benchmarks. > Terms dictionary compression > ---------------------------- > > Key: LUCENE-4702 > URL: https://issues.apache.org/jira/browse/LUCENE-4702 > Project: Lucene - Core > Issue Type: Wish > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Trivial > Attachments: LUCENE-4702.patch, LUCENE-4702.patch > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I've done a quick test with the block tree terms dictionary by replacing a > call to IndexOutput.writeBytes to write suffix bytes with a call to > LZ4.compressHC to test the peformance hit. Interestingly, search performance > was very good (see comparison table below) and the tim files were 14% smaller > (from 150432 bytes overall to 129516). > {noformat} > TaskQPS baseline StdDevQPS compressed StdDev > Pct diff > Fuzzy1 111.50 (2.0%) 78.78 (1.5%) > -29.4% ( -32% - -26%) > Fuzzy2 36.99 (2.7%) 28.59 (1.5%) > -22.7% ( -26% - -18%) > Respell 122.86 (2.1%) 103.89 (1.7%) > -15.4% ( -18% - -11%) > Wildcard 100.58 (4.3%) 94.42 (3.2%) > -6.1% ( -13% - 1%) > Prefix3 124.90 (5.7%) 122.67 (4.7%) > -1.8% ( -11% - 9%) > OrHighLow 169.87 (6.8%) 167.77 (8.0%) > -1.2% ( -15% - 14%) > LowTerm 1949.85 (4.5%) 1929.02 (3.4%) > -1.1% ( -8% - 7%) > AndHighLow 2011.95 (3.5%) 1991.85 (3.3%) > -1.0% ( -7% - 5%) > OrHighHigh 155.63 (6.7%) 154.12 (7.9%) > -1.0% ( -14% - 14%) > AndHighHigh 341.82 (1.2%) 339.49 (1.7%) > -0.7% ( -3% - 2%) > OrHighMed 217.55 (6.3%) 216.16 (7.1%) > -0.6% ( -13% - 13%) > IntNRQ 53.10 (10.9%) 52.90 (8.6%) > -0.4% ( -17% - 21%) > MedTerm 998.11 (3.8%) 994.82 (5.6%) > -0.3% ( -9% - 9%) > MedSpanNear 60.50 (3.7%) 60.36 (4.8%) > -0.2% ( -8% - 8%) > HighSpanNear 19.74 (4.5%) 19.72 (5.1%) > -0.1% ( -9% - 9%) > LowSpanNear 101.93 (3.2%) 101.82 (4.4%) > -0.1% ( -7% - 7%) > AndHighMed 366.18 (1.7%) 366.93 (1.7%) > 0.2% ( -3% - 3%) > PKLookup 237.28 (4.0%) 237.96 (4.2%) > 0.3% ( -7% - 8%) > MedPhrase 173.17 (4.7%) 174.69 (4.7%) > 0.9% ( -8% - 10%) > LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%) > 1.0% ( -4% - 6%) > LowPhrase 374.64 (5.5%) 379.11 (5.8%) > 1.2% ( -9% - 13%) > HighTerm 253.14 (7.9%) 256.97 (11.4%) > 1.5% ( -16% - 22%) > HighPhrase 19.52 (10.6%) 19.83 (11.0%) > 1.6% ( -18% - 25%) > MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%) > 1.6% ( -3% - 6%) > HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%) > 2.8% ( -6% - 13%) > {noformat} > Only queries which are very terms-dictionary-intensive got a performance hit > (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved > (surprisingly) well. > Do you think of it as something worth exploring? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org