thanks, but that's what i started with, but it took an even longer time and threw this: Approaching too many values for UnInvertedField faceting on field 'text' : bucket size=15560140 Approaching too many values for UnInvertedField faceting on field 'text : bucket size=15619075 Exception during facet counts:org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field text
On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > I think faceting is probably the best way to do that, indeed. It might be > slow, but it's kind of set up for exactly that case, I can't imagine any > other technique being faster -- there's stuff that has to be done to look up > the info you want. > > BUT, I see your problem: don't use facet.method=enum. Use facet.method=fc. > Works a LOT better for very high arity fields (lots and lots of unique > values) like you have. I bet you'll see significant speed-up if you use > facet.method=fc instead, hopefully fast enough to be workable. > > With facet.method=enum, I would have indeed predicted it would be horribly > slow, before solr 1.4 when facet.method=fc became available, it was nearly > impossible to facet on very high arity fields, facet.method=fc is the magic. > I think facet.method=fc is even the default in Solr 1.4+, if you hadn't > explicitly set it to enum instead! > > Jonathan > ________________________________________ > From: Ofer Fort [ofer...@gmail.com] > Sent: Wednesday, April 20, 2011 6:49 PM > To: solr-user@lucene.apache.org > Subject: Highest frequency terms for a subset of documents > Hi, > I am looking for the best way to find the terms with the highest frequency > for a given subset of documents. (terms in the text field) > My first thought was to do a count facet search , where the query defines > the subset of documents and the facet.field is the text field, this gives > me > the result but it is very very slow. > These are my params: > <str name="facet">true</str> > <str name="facet.offset">0</str> > <str name="facet.mincount">3</str> > <str name="indent">on</str> > <str name="facet.limit">500</str> > <str name="facet.method">enum</str> > <str name="wt">xml</str> > <str name="rows">0</str> > <str name="version">2.2</str> > <str name="facet.sort">count</str> > <str name="q">in_subset:1</str> > <str name="facet.field">text</str> > </lst> > > The index contains 7M documents, the subset is about 200K. A simple query > for the subset takes around 100ms, but the facet search takes 40s. > > Am i doing something wrong? > > If facet search is not the correct approach, i thought about using > something > like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this > in solr. Should i implememt a request handler that executes this kind of > code? > > thanks for any help >