[ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand resolved LUCENE-4702. ---------------------------------- Fix Version/s: 8.5 Resolution: Fixed I pushed a last change that resolves some of the slowdown by being less aggressive on the blocks whose prefix length is 2 or less, which are always all visited by fuzzy queries with an edit distance of 2. This only increases the size of the tim files from 969MB to 973MB. The nightly benchmarks should get another (small) bump when moving to JDK13 which better optimizes the decompression logic. > Terms dictionary compression > ---------------------------- > > Key: LUCENE-4702 > URL: https://issues.apache.org/jira/browse/LUCENE-4702 > Project: Lucene - Core > Issue Type: Wish > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Trivial > Fix For: 8.5 > > Attachments: LUCENE-4702.patch, LUCENE-4702.patch > > Time Spent: 3h 50m > Remaining Estimate: 0h > > I've done a quick test with the block tree terms dictionary by replacing a > call to IndexOutput.writeBytes to write suffix bytes with a call to > LZ4.compressHC to test the peformance hit. Interestingly, search performance > was very good (see comparison table below) and the tim files were 14% smaller > (from 150432 bytes overall to 129516). > {noformat} > TaskQPS baseline StdDevQPS compressed StdDev > Pct diff > Fuzzy1 111.50 (2.0%) 78.78 (1.5%) > -29.4% ( -32% - -26%) > Fuzzy2 36.99 (2.7%) 28.59 (1.5%) > -22.7% ( -26% - -18%) > Respell 122.86 (2.1%) 103.89 (1.7%) > -15.4% ( -18% - -11%) > Wildcard 100.58 (4.3%) 94.42 (3.2%) > -6.1% ( -13% - 1%) > Prefix3 124.90 (5.7%) 122.67 (4.7%) > -1.8% ( -11% - 9%) > OrHighLow 169.87 (6.8%) 167.77 (8.0%) > -1.2% ( -15% - 14%) > LowTerm 1949.85 (4.5%) 1929.02 (3.4%) > -1.1% ( -8% - 7%) > AndHighLow 2011.95 (3.5%) 1991.85 (3.3%) > -1.0% ( -7% - 5%) > OrHighHigh 155.63 (6.7%) 154.12 (7.9%) > -1.0% ( -14% - 14%) > AndHighHigh 341.82 (1.2%) 339.49 (1.7%) > -0.7% ( -3% - 2%) > OrHighMed 217.55 (6.3%) 216.16 (7.1%) > -0.6% ( -13% - 13%) > IntNRQ 53.10 (10.9%) 52.90 (8.6%) > -0.4% ( -17% - 21%) > MedTerm 998.11 (3.8%) 994.82 (5.6%) > -0.3% ( -9% - 9%) > MedSpanNear 60.50 (3.7%) 60.36 (4.8%) > -0.2% ( -8% - 8%) > HighSpanNear 19.74 (4.5%) 19.72 (5.1%) > -0.1% ( -9% - 9%) > LowSpanNear 101.93 (3.2%) 101.82 (4.4%) > -0.1% ( -7% - 7%) > AndHighMed 366.18 (1.7%) 366.93 (1.7%) > 0.2% ( -3% - 3%) > PKLookup 237.28 (4.0%) 237.96 (4.2%) > 0.3% ( -7% - 8%) > MedPhrase 173.17 (4.7%) 174.69 (4.7%) > 0.9% ( -8% - 10%) > LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%) > 1.0% ( -4% - 6%) > LowPhrase 374.64 (5.5%) 379.11 (5.8%) > 1.2% ( -9% - 13%) > HighTerm 253.14 (7.9%) 256.97 (11.4%) > 1.5% ( -16% - 22%) > HighPhrase 19.52 (10.6%) 19.83 (11.0%) > 1.6% ( -18% - 25%) > MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%) > 1.6% ( -3% - 6%) > HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%) > 2.8% ( -6% - 13%) > {noformat} > Only queries which are very terms-dictionary-intensive got a performance hit > (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved > (surprisingly) well. > Do you think of it as something worth exploring? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org