[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528369#comment-17528369 ]
Kevin Risden commented on LUCENE-10534: --------------------------------------- I'm still getting used to flamegraphs so I'll explain my understanding and see if we are on the same page. I'm going to put up a PR with suggested changes as well in case that helps. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=250,width=250! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long.Value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: https://github.com/apache/lucene/pull/840 ---- I also fixed the Min/Max function PR to remove the duplicate exists checks and remove the special 0.0f casing. https://github.com/apache/lucene/pull/837 > MinFloatFunction / MaxFloatFunction exists check can be slow > ------------------------------------------------------------ > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Kevin Risden > Assignee: Kevin Risden > Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 20m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org