[ https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071619#comment-17071619 ]
Cao Manh Dat commented on SOLR-14365: ------------------------------------- [~jbernste] [~shalin] please take a look at the patch. > CollapsingQParser - Avoiding always allocate int[] and float[] with size > equals to number of unique values > ---------------------------------------------------------------------------------------------------------- > > Key: SOLR-14365 > URL: https://issues.apache.org/jira/browse/SOLR-14365 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 8.4.1 > Reporter: Cao Manh Dat > Assignee: Cao Manh Dat > Priority: Major > Attachments: SOLR-14365.patch > > > Since Collapsing is a PostFilter, documents reach Collapsing must match with > all filters and queries, so the number of documents Collapsing need to > collect/compute score is a small fraction of the total number documents in > the index. So why do we need to always consume the memory (for int[] and > float[] array) for all unique values of the collapsed field? If the number of > unique values of the collapsed field found in the documents that match > queries and filters is 300 then we only need int[] and float[] array with > size of 300 and not 1.2 million in size. However, we don't know which value > of the collapsed field will show up in the results so we cannot use a smaller > array. > The easy fix for this problem is using as much as we need by using IntIntMap > and IntFloatMap that hold primitives and are much more space efficient than > the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and > float[] if matched documents is large (almost all documents matched queries > and other filters). But our belief is that does not happen that frequently > (how frequently do we run collapsing on the entire index?). > For this issue I propose adding 2 methods for collapsing which is > * array : which is current implementation > * hash : which is new approach and will be default method > later we can add another method {{smart}} which is automatically pick method > based on comparision between {{number of docs matched queries and filters}} > and {{number of unique values of the field}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org