Re: Fast faceting over large number of distinct terms

Brendan Grainger Wed, 22 May 2013 19:32:51 -0700

Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
 From your description: "story_text is used to store free form text
obtained by crawling new papers and blogs", it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.


Cheers,
Brendan


On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
dlaroche...@cyber.law.harvard.edu> wrote:

> I'm trying to quickly obtain cumulative word frequency counts over all
> documents matching a particular query.
>
> I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB
> and has around ~350,000 documents.
>
> My schema includes the following fields:
>
> <field name="id" type="string" indexed="true" stored="true" required="true"
> multiValued="false" />
> <field name="media_id" type="int" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="story_text"  type="text_general" indexed="true" stored="true"
> termVectors="true" termPositions="true" termOffsets="true" />
>
>
> story_text is used to store free form text obtained by crawling new papers
> and blogs.
>
> Running faceted searches with the fc or fcs methods fails with the error
> "Too many values for UnInvertedField faceting on field story_text"
>
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>
> Running faceted search with the 'enum' method succeeds but takes a very
> long time.
>
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> <
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >
>
> The frustrating thing is even if the query only returns a few hundred
> documents, it still takes 10 minutes or longer to get the cumulative word
> count results.
>
> Eventually we're hoping to build a system that will return results in a few
> seconds and scale to hundreds of millions of documents.
> Is there anyway to get this level of performance out of Solr/Lucene?
>
> Thanks,
>
> David
>



-- 
Brendan Grainger
www.kuripai.com

Re: Fast faceting over large number of distinct terms

Reply via email to