Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
stefanvodita commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890420916 I found a fun HeisenBug in one of the tests. When we iterate cursors from `IntFloatHashMap`, the order is not deterministic. Float summation is not commutative, so the result we get by aggregating the floats in the map can be different depending on the order in which we perform the iteration. For a particular seed, running the test was producing an ordering that was not favorable, while running the debugger produced an ordering that was. The test is fixed in the latest commit and I've opened an [issue to do Kahan summation](https://github.com/apache/lucene/issues/13011) over the floats instead, to reduce the error we're seeing. For those who want to follow along, here are the exact numbers we are adding in the test in two orderings which produce different results: ``` class FloatSunIsNotCommutative { public static void main(String[] args) { float x = 177182.61f; float y = 238089.27f; float z = 255214.66f; float acc; acc = 0; acc += x; acc += y; acc += z; System.out.println(acc); acc = 0; acc += z; acc += y; acc += x; System.out.println(acc); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
stefanvodita commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890420978 I've also run the benchmarks (`python3 src/python/localrun.py -source wikimediumall`). There is measurable regression in the `BrowseRandomLabelTaxoFacets` task, but not in other taxonomy tasks. The benchmarker also reports improvements in `PKLookup`, `Wildcard`, `Respell`, `Fuzzy2`, `Fuzzy1`. The regression in the taxo task is explained in the profiler. Boxing is not cheap: `11.24%10402Mjava.lang.Integer#valueOf()` @mikecan (thank you for the review!) - how should I interpret the other tasks which show a significant change? Are they just noisy? ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value BrowseRandomLabelTaxoFacets3.75 (1.8%)3.53 (1.6%) -6.0% ( -9% - -2%) 0.000 OrHighMedDayTaxoFacets1.35 (7.4%)1.31 (9.2%) -2.7% ( -17% - 15%) 0.308 IntNRQ 21.64 (7.0%) 21.35 (7.4%) -1.3% ( -14% - 14%) 0.561 AndHighLow 366.49 (11.2%) 362.21 (10.3%) -1.2% ( -20% - 22%) 0.731 OrHighNotLow 271.40 (5.3%) 269.03 (4.5%) -0.9% ( -10% -9%) 0.573 LowTerm 604.77 (5.9%) 599.96 (4.8%) -0.8% ( -10% - 10%) 0.640 TermDTSort 140.65 (2.3%) 139.58 (1.4%) -0.8% ( -4% -3%) 0.210 LowSpanNear5.00 (2.8%)4.96 (4.1%) -0.7% ( -7% -6%) 0.522 HighSpanNear4.77 (3.0%)4.74 (3.6%) -0.7% ( -7% -6%) 0.522 MedSpanNear 11.24 (2.1%) 11.18 (2.5%) -0.6% ( -5% -4%) 0.432 MedPhrase 242.61 (2.2%) 241.23 (2.0%) -0.6% ( -4% -3%) 0.386 HighPhrase 83.17 (2.1%) 82.75 (2.9%) -0.5% ( -5% -4%) 0.538 OrHighNotHigh 160.48 (4.5%) 159.81 (3.5%) -0.4% ( -8% -7%) 0.744 HighTermDayOfYearSort 215.60 (2.2%) 214.81 (2.0%) -0.4% ( -4% -3%) 0.576 MedSloppyPhrase 14.07 (2.0%) 14.03 (2.4%) -0.3% ( -4% -4%) 0.655 LowPhrase 21.15 (1.3%) 21.09 (1.5%) -0.3% ( -3% -2%) 0.508 AndHighHighDayTaxoFacets 10.49 (1.2%) 10.46 (1.6%) -0.3% ( -3% -2%) 0.547 HighSloppyPhrase 13.80 (3.0%) 13.77 (3.1%) -0.3% ( -6% -5%) 0.791 MedTerm 479.88 (5.1%) 478.82 (4.8%) -0.2% ( -9% - 10%) 0.887 OrHighNotMed 329.08 (4.5%) 328.39 (3.5%) -0.2% ( -7% -8%) 0.870 HighTerm 264.78 (5.3%) 264.27 (5.2%) -0.2% ( -10% - 10%) 0.908 HighTermMonthSort 1930.74 (4.4%) 1928.03 (5.2%) -0.1% ( -9% -9%) 0.926 OrNotHighMed 217.72 (2.9%) 217.51 (2.2%) -0.1% ( -5% -5%) 0.905 MedTermDayTaxoFacets 16.72 (2.1%) 16.71 (1.7%) -0.1% ( -3% -3%) 0.892 BrowseDayOfYearSSDVFacets4.12 (2.7%)4.11 (2.9%) -0.1% ( -5% -5%) 0.931 BrowseDateTaxoFacets4.68 (5.1%)4.67 (4.6%) -0.1% ( -9% - 10%) 0.970 OrNotHighHigh 231.09 (4.5%) 230.99 (3.5%) -0.0% ( -7% -8%) 0.975 AndHighMedDayTaxoFacets 16.88 (1.1%) 16.88 (1.5%) -0.0% ( -2% -2%) 0.963 BrowseDayOfYearTaxoFacets4.76 (5.2%)4.76 (4.6%)0.0% ( -9% - 10%) 1.000 OrNotHighLow 464.54 (2.6%) 464.56 (2.3%)0.0% ( -4% -5%) 0.995 HighIntervalsOrdered1.81 (4.6%)1.81 (5.0%)0.0% ( -9% - 10%) 0.990 HighTermTitleBDVSort5.39 (4.8%)5.40 (4.4%)0.1% ( -8% -9%) 0.968 BrowseMonthSSDVFacets4.40 (2.6%)4.40 (2.6%)0.1% ( -4% -5%) 0.873 MedIntervalsOrdered1.84 (5.5%)1.84 (5.8%)0.2% ( -10% - 12%) 0.918 LowIntervalsOrdered 32.12 (5.4%) 32.18 (5.6%)0.2% ( -10% - 11%) 0.913 OrHighMed 67.77 (3.1%) 67.97 (3.4%)0.3% ( -5% -6%) 0.779 Browse
Re: [I] Use Kahan summation for float aggregations to reduce errors [lucene]
mikemccand commented on issue #13011: URL: https://github.com/apache/lucene/issues/13011#issuecomment-1890454889 Neat -- I had never heard of Kahan summation. Here is its [Wikipedia page](https://en.wikipedia.org/wiki/Kahan_summation_algorithm). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
mikemccand commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890455264 > I found a fun HeisenBug in one of the tests. Oh the joys of floating point math. > For those who want to follow along, here are the exact numbers we are adding in the test in two orderings which produce different results: Thank you for diving deep here and making such a simple reproduction. > how should I interpret the other tasks which show a significant change? Are they just noisy? Good question -- it makes no sense that e.g. `Respell/Fuzzy1/2` got faster with this change, though the benchy seems to think it is significant (`p=0.000`). I'm not sure what to make of it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Initial impl of MMapDirectory for Java 22 [lucene]
uschindler commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1890455325 I also committed supprot for incubator SIMD vectorization in Java 22. According to Java's change logs there were no changes to API at all, so code runs as is. @ChrisHegarty just have a look, but to me it looks like the changes are the only ones needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
mikemccand commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890455536 > The regression in the taxo task is explained in the profiler. Boxing is not cheap: > `11.24% 10402M java.lang.Integer#valueOf()` Hmm this is sort of spooky -- should we aim to keep the specialization somehow (avoid the boxing)? Is there a middle ground where we can avoid the boxing but still remove much of / some of this duplicated code? Java is annoying sometimes :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Split taxonomy arrays across chunks [lucene]
stefanvodita commented on code in PR #12995: URL: https://github.com/apache/lucene/pull/12995#discussion_r1451664416 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java: ## @@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, TaxonomyIndexArrays copyFrom) thr // it may be caused if e.g. the taxonomy segments were merged, and so an updated // NRT reader was obtained, even though nothing was changed. this is not very likely // to happen. -int[] copyParents = copyFrom.parents(); -this.parents = new int[reader.maxDoc()]; -System.arraycopy(copyParents, 0, parents, 0, copyParents.length); -initParents(reader, copyParents.length); - +int[][] parentArray = allocateChunkedArray(reader.maxDoc()); +copyChunkedArray(copyFrom.parents.values, parentArray); +initParents(parentArray, reader, copyFrom.parents.length()); +parents = new ChunkedArray(parentArray); if (copyFrom.initializedChildren) { initChildrenSiblings(copyFrom); } } + private static int[][] allocateChunkedArray(int size) { +int chunkCount = size / CHUNK_SIZE + 1; Review Comment: I missed this thread that was still pending. @msfroh - what do you think, can we simplify this method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
stefanvodita commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890862937 What I've done is I've only taken advantage of the boxing for genericity when collecting results `getTop...` and not use it while performing the aggregations themselves. Most of the taxonomy tasks are not showing a significant performance change. I wonder if the one that has slowed down spends more time collecting the aggregation values than calculating them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org