Re: Highest frequency terms for a subset of documents

Ofer Fort Wed, 20 Apr 2011 16:23:55 -0700

thanks, but that's what i started with, but it took an even longer time and
threw this:
Approaching too many values for UnInvertedField faceting on field 'text' :
bucket size=15560140
Approaching too many values for UnInvertedField faceting on field 'text :
bucket size=15619075
Exception during facet counts:org.apache.solr.common.SolrException: Too many
values for UnInvertedField faceting on field text



On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind <[email protected]> wrote:

> I think faceting is probably the best way to do that, indeed. It might be
> slow, but it's kind of set up for exactly that case, I can't imagine any
> other technique being faster -- there's stuff that has to be done to look up
> the info you want.
>
> BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.
>  Works a LOT better for very high arity fields (lots and lots of unique
> values) like you have. I bet you'll see significant speed-up if you use
> facet.method=fc instead, hopefully fast enough to be workable.
>
> With facet.method=enum, I would have indeed predicted it would be horribly
> slow, before solr 1.4 when facet.method=fc became available, it was nearly
> impossible to facet on very high arity fields, facet.method=fc is the magic.
> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
> explicitly set it to enum instead!
>
> Jonathan
> ________________________________________
> From: Ofer Fort [[email protected]]
> Sent: Wednesday, April 20, 2011 6:49 PM
> To: [email protected]
> Subject: Highest frequency terms for a subset of documents
> Hi,
> I am looking for the best way to find the terms with the highest frequency
> for a given subset of documents. (terms in the text field)
> My first thought was to do a count facet search , where the query defines
> the subset of documents and the facet.field is the text field, this gives
> me
> the result but it is very very slow.
> These are my params:
> <str name="facet">true</str>
> <str name="facet.offset">0</str>
> <str name="facet.mincount">3</str>
> <str name="indent">on</str>
> <str name="facet.limit">500</str>
> <str name="facet.method">enum</str>
> <str name="wt">xml</str>
> <str name="rows">0</str>
> <str name="version">2.2</str>
> <str name="facet.sort">count</str>
>   <str name="q">in_subset:1</str>
> <str name="facet.field">text</str>
> </lst>
>
> The index contains 7M documents, the subset is about 200K. A simple query
> for the subset takes around 100ms, but the facet search takes 40s.
>
> Am i doing something wrong?
>
> If facet search is not the correct approach, i thought about using
> something
> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
> in solr. Should i implememt a request handler that executes this kind of
> code?
>
> thanks for any help
>

Re: Highest frequency terms for a subset of documents

Reply via email to