Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-13 Thread via GitHub


stefanvodita commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890420916

   I found a fun HeisenBug in one of the tests. When we iterate cursors from 
`IntFloatHashMap`, the order is not deterministic. Float summation is not 
commutative, so the result we get by aggregating the floats in the map can be 
different depending on the order in which we perform the iteration. For a 
particular seed, running the test was producing an ordering that was not 
favorable, while running the debugger produced an ordering that was. The test 
is fixed in the latest commit and I've opened an [issue to do Kahan 
summation](https://github.com/apache/lucene/issues/13011) over the floats 
instead, to reduce the error we're seeing.
   
   For those who want to follow along, here are the exact numbers we are adding 
in the test in two orderings which produce different results:
   ```
   class FloatSunIsNotCommutative {
   public static void main(String[] args) {
   float x = 177182.61f;
   float y = 238089.27f;
   float z = 255214.66f;
   float acc;
   
   acc = 0;
   acc += x;
   acc += y;
   acc += z;
   System.out.println(acc);
   
   acc = 0;
   acc += z;
   acc += y;
   acc += x;
   System.out.println(acc);
   }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-13 Thread via GitHub


stefanvodita commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890420978

   I've also run the benchmarks (`python3 src/python/localrun.py -source 
wikimediumall`). There is measurable regression in the 
`BrowseRandomLabelTaxoFacets` task, but not in other taxonomy tasks. The 
benchmarker also reports improvements in `PKLookup`, `Wildcard`, `Respell`, 
`Fuzzy2`, `Fuzzy1`.
   
   The regression in the taxo task is explained in the profiler. Boxing is not 
cheap:
   `11.24%10402Mjava.lang.Integer#valueOf()`
   
   @mikecan (thank you for the review!) - how should I interpret the other 
tasks which show a significant change? Are they just noisy?
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
BrowseRandomLabelTaxoFacets3.75  (1.8%)3.53  
(1.6%)   -6.0% (  -9% -   -2%) 0.000
 OrHighMedDayTaxoFacets1.35  (7.4%)1.31  
(9.2%)   -2.7% ( -17% -   15%) 0.308
 IntNRQ   21.64  (7.0%)   21.35  
(7.4%)   -1.3% ( -14% -   14%) 0.561
 AndHighLow  366.49 (11.2%)  362.21 
(10.3%)   -1.2% ( -20% -   22%) 0.731
   OrHighNotLow  271.40  (5.3%)  269.03  
(4.5%)   -0.9% ( -10% -9%) 0.573
LowTerm  604.77  (5.9%)  599.96  
(4.8%)   -0.8% ( -10% -   10%) 0.640
 TermDTSort  140.65  (2.3%)  139.58  
(1.4%)   -0.8% (  -4% -3%) 0.210
LowSpanNear5.00  (2.8%)4.96  
(4.1%)   -0.7% (  -7% -6%) 0.522
   HighSpanNear4.77  (3.0%)4.74  
(3.6%)   -0.7% (  -7% -6%) 0.522
MedSpanNear   11.24  (2.1%)   11.18  
(2.5%)   -0.6% (  -5% -4%) 0.432
  MedPhrase  242.61  (2.2%)  241.23  
(2.0%)   -0.6% (  -4% -3%) 0.386
 HighPhrase   83.17  (2.1%)   82.75  
(2.9%)   -0.5% (  -5% -4%) 0.538
  OrHighNotHigh  160.48  (4.5%)  159.81  
(3.5%)   -0.4% (  -8% -7%) 0.744
  HighTermDayOfYearSort  215.60  (2.2%)  214.81  
(2.0%)   -0.4% (  -4% -3%) 0.576
MedSloppyPhrase   14.07  (2.0%)   14.03  
(2.4%)   -0.3% (  -4% -4%) 0.655
  LowPhrase   21.15  (1.3%)   21.09  
(1.5%)   -0.3% (  -3% -2%) 0.508
   AndHighHighDayTaxoFacets   10.49  (1.2%)   10.46  
(1.6%)   -0.3% (  -3% -2%) 0.547
   HighSloppyPhrase   13.80  (3.0%)   13.77  
(3.1%)   -0.3% (  -6% -5%) 0.791
MedTerm  479.88  (5.1%)  478.82  
(4.8%)   -0.2% (  -9% -   10%) 0.887
   OrHighNotMed  329.08  (4.5%)  328.39  
(3.5%)   -0.2% (  -7% -8%) 0.870
   HighTerm  264.78  (5.3%)  264.27  
(5.2%)   -0.2% ( -10% -   10%) 0.908
  HighTermMonthSort 1930.74  (4.4%) 1928.03  
(5.2%)   -0.1% (  -9% -9%) 0.926
   OrNotHighMed  217.72  (2.9%)  217.51  
(2.2%)   -0.1% (  -5% -5%) 0.905
   MedTermDayTaxoFacets   16.72  (2.1%)   16.71  
(1.7%)   -0.1% (  -3% -3%) 0.892
  BrowseDayOfYearSSDVFacets4.12  (2.7%)4.11  
(2.9%)   -0.1% (  -5% -5%) 0.931
   BrowseDateTaxoFacets4.68  (5.1%)4.67  
(4.6%)   -0.1% (  -9% -   10%) 0.970
  OrNotHighHigh  231.09  (4.5%)  230.99  
(3.5%)   -0.0% (  -7% -8%) 0.975
AndHighMedDayTaxoFacets   16.88  (1.1%)   16.88  
(1.5%)   -0.0% (  -2% -2%) 0.963
  BrowseDayOfYearTaxoFacets4.76  (5.2%)4.76  
(4.6%)0.0% (  -9% -   10%) 1.000
   OrNotHighLow  464.54  (2.6%)  464.56  
(2.3%)0.0% (  -4% -5%) 0.995
   HighIntervalsOrdered1.81  (4.6%)1.81  
(5.0%)0.0% (  -9% -   10%) 0.990
   HighTermTitleBDVSort5.39  (4.8%)5.40  
(4.4%)0.1% (  -8% -9%) 0.968
  BrowseMonthSSDVFacets4.40  (2.6%)4.40  
(2.6%)0.1% (  -4% -5%) 0.873
MedIntervalsOrdered1.84  (5.5%)1.84  
(5.8%)0.2% ( -10% -   12%) 0.918
LowIntervalsOrdered   32.12  (5.4%)   32.18  
(5.6%)0.2% ( -10% -   11%) 0.913
  OrHighMed   67.77  (3.1%)   67.97  
(3.4%)0.3% (  -5% -6%) 0.779
Browse

Re: [I] Use Kahan summation for float aggregations to reduce errors [lucene]

2024-01-13 Thread via GitHub


mikemccand commented on issue #13011:
URL: https://github.com/apache/lucene/issues/13011#issuecomment-1890454889

   Neat -- I had never heard of Kahan summation.  Here is its [Wikipedia 
page](https://en.wikipedia.org/wiki/Kahan_summation_algorithm).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-13 Thread via GitHub


mikemccand commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890455264

   > I found a fun HeisenBug in one of the tests.
   
   Oh the joys of floating point math.
   
   > For those who want to follow along, here are the exact numbers we are 
adding in the test in two orderings which produce different results:
   
   Thank you for diving deep here and making such a simple reproduction.
   
   > how should I interpret the other tasks which show a significant change? 
Are they just noisy?
   
   Good question -- it makes no sense that e.g. `Respell/Fuzzy1/2`  got faster 
with this change, though the benchy seems to think it is significant 
(`p=0.000`).  I'm not sure what to make of it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Initial impl of MMapDirectory for Java 22 [lucene]

2024-01-13 Thread via GitHub


uschindler commented on PR #12706:
URL: https://github.com/apache/lucene/pull/12706#issuecomment-1890455325

   I also committed supprot for incubator SIMD vectorization in Java 22. 
According to Java's change logs there were no changes to API at all, so code 
runs as is.
   
   @ChrisHegarty just have a look, but to me it looks like the changes are the 
only ones needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-13 Thread via GitHub


mikemccand commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890455536

   > The regression in the taxo task is explained in the profiler. Boxing is 
not cheap:
   > `11.24% 10402M java.lang.Integer#valueOf()`
   
   Hmm this is sort of spooky -- should we aim to keep the specialization 
somehow (avoid the boxing)?  Is there a middle ground where we can avoid the 
boxing but still remove much of / some of this duplicated code?  Java is 
annoying sometimes :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Split taxonomy arrays across chunks [lucene]

2024-01-13 Thread via GitHub


stefanvodita commented on code in PR #12995:
URL: https://github.com/apache/lucene/pull/12995#discussion_r1451664416


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/TaxonomyIndexArrays.java:
##
@@ -68,25 +90,49 @@ public TaxonomyIndexArrays(IndexReader reader, 
TaxonomyIndexArrays copyFrom) thr
 // it may be caused if e.g. the taxonomy segments were merged, and so an 
updated
 // NRT reader was obtained, even though nothing was changed. this is not 
very likely
 // to happen.
-int[] copyParents = copyFrom.parents();
-this.parents = new int[reader.maxDoc()];
-System.arraycopy(copyParents, 0, parents, 0, copyParents.length);
-initParents(reader, copyParents.length);
-
+int[][] parentArray = allocateChunkedArray(reader.maxDoc());
+copyChunkedArray(copyFrom.parents.values, parentArray);
+initParents(parentArray, reader, copyFrom.parents.length());
+parents = new ChunkedArray(parentArray);
 if (copyFrom.initializedChildren) {
   initChildrenSiblings(copyFrom);
 }
   }
 
+  private static int[][] allocateChunkedArray(int size) {
+int chunkCount = size / CHUNK_SIZE + 1;

Review Comment:
   I missed this thread that was still pending. @msfroh - what do you think, 
can we simplify this method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-01-13 Thread via GitHub


stefanvodita commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1890862937

   What I've done is I've only taken advantage of the boxing for genericity 
when collecting results `getTop...` and not use it while performing the 
aggregations themselves. Most of the taxonomy tasks are not showing a 
significant performance change. I wonder if the one that has slowed down spends 
more time collecting the aggregation values than calculating them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org