[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lu Xugang updated LUCENE-10281: ------------------------------- Description: Description: In construction method StringValueFacetCounts(StringDocValuesReaderState state, FacetsCollector facetsCollector), if facetsCollector was provided, a condition of *(totalHits < totalDocs / 10)* used to judge whether using IntIntHashMap which means sparse to store term ord and count 。 But per totalHits doesn't means it must be containing SSDV , and so is totalDocs. so the right calculation should be *( totalHits has SSDV) / (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get because we can only read index by docId provided by FacetsCollector, but the way of getting *totalHits has SSDV* is slow and redundant. Solution: if we don't wanna to break the old logic that using denseCounts while cardinality < 1024 and using IntIntHashMap while 10% threshold and using denseCounts while the rest of the case, then we could still use denseCounts if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique term collected,then change to use denseCounts. was: Description: In construction method StringValueFacetCounts(StringDocValuesReaderState state, FacetsCollector facetsCollector), if facetsCollector was provided, a condition of *totalHits < totalDocs / 10 * used to judge whether using IntIntHashMap which means sparse to store term ord and count 。but per totalHits doesn't means it must be containing SSDV , and so is totalDocs. so the right calculation should be *( totalHits has SSDV) / (totalDocs has SSDV) *. *totalDocs has SSDV* was easy to get by SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get because we can only read index by docId provided in FacetsCollector, but the way of getting *totalHits has SSDV* is slow and redundant. Solution: if we don't wanna to break the old logic that using denseCounts while cardinality < 1024 and using IntIntHashMap while 10% threshold and using denseCounts while the rest of the case, then we could still use denseCounts while cardinality < 1024, if not , we use IntIntHashMap. when 10% of the unique term collected,then change to use denseCounts. > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > ------------------------------------------------------------------------------- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet > Affects Versions: 8.11 > Reporter: Lu Xugang > Priority: Major > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org