Re: Highest frequency terms for a subset of documents

Chris Hostetter Wed, 20 Apr 2011 16:35:05 -0700

: thanks, but that's what i started with, but it took an even longer time and
: threw this:
: Approaching too many values for UnInvertedField faceting on field 'text' :
: bucket size=15560140
: Approaching too many values for UnInvertedField faceting on field 'text :
: bucket size=15619075
: Exception during facet counts:org.apache.solr.common.SolrException: Too many
: values for UnInvertedField faceting on field text


right ... facet.method=fc is a good default, but cases like full text 
faceting can cause it to seriously blow up the memory ... i didn't eve 
realize it was possible to get it to fail this way, i would have just 
expected an OutOfmemoryException.

facet.method=enum is probably your best bet in this situation precisely 
because it does a linera scan over the terms ... it's slower because it's 
safer.

the one speed up you might be able to get is to ensure you don't use the 
filterCache -- that way you don't wast time constantly caching/overwriting 
DocSets

and FWIW...

: > If facet search is not the correct approach, i thought about using
: > something
: > like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
: > in solr. Should i implememt a request handler that executes this kind of

HighFreqTerms just looks at the raw docfreq for the terms, nearly 
identical to the TermsComponent -- there is no way to deal with your 
"subset of documents" requrements using an approach like that.

If the number of subsets you have to deal with are fixed, finite, and 
non-overlapping, using distinct cores for each subset (which you can 
aggregate using distributed search when you don't want this type of query) 
can also be a wise choice in many situations

(ie: if you have a "books" core and a "movies" core you can search both 
using distributed search, or hit the terms component on just one of them 
to get the top terms for that core)

-Hoss

Re: Highest frequency terms for a subset of documents

Reply via email to