Do you have a sense of what your typical queries would look like? I mean, maybe you wouldn't actually need to fetch more than a tiny fraction of those million documents. Do you only need to determine the top 10 or 20 or 50 unique field value row sets, or do you need to determine ALL unique row sets? The latter would never be very performant even as a custom handler/collector since it would have to scan all rows.
Try a client-side solution that reads 100 (or 50 or 20 or 200) rows at a time, storing rows by the unique combination of field values, until you hit the threshold needed for number of unique row sets. -- Jack Krupansky On Tue, Jan 13, 2015 at 4:29 PM, tedsolr <tsm...@sciquest.com> wrote: > I have a complicated problem to solve, and I don't know enough about > lucene/solr to phrase the question properly. This is kind of a shot in the > dark. My requirement is to return search results always in completely > "collapsed" form, rolling up duplicates with a count. Duplicates are > defined > by whatever fields are requested. If the search requests fields A, B, C, > then all matched documents that have identical values for those 3 fields > are > "dupes". The field list may change with every new search request. What I do > know is the super set of all fields that may be part of the field list at > index time. > > I know this can't be done with configuration alone. It doesn't seem > performant to retrieve all 1M+ docs and post process in Java. A very smart > person told me that a custom hit collector should be able to do the > filtering for me. So, maybe I create a custom search handler that somehow > exposes this custom hit collector that can use FieldCache or DocValues to > examine all the matches and filter the results in the way I've described > above. > > So assuming this is a viable solution path, can anyone suggest some helpful > posts, code fragments, books for me to review? I admit to being out of my > depth, but this requirement isn't going away. I'm grasping for straws right > now. > > thanks > (using Solr 4.9) > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Engage-custom-hit-collector-for-special-search-processing-tp4179348.html > Sent from the Solr - User mailing list archive at Nabble.com. >