The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers.
Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger < brendan.grain...@gmail.com> wrote: > Hi David, > > Out of interest, what are you trying to accomplish by faceting over the > story_text field? Is it generally the case that the story_text field will > contain values that are repeated or categorize your documents somehow? > From your description: "story_text is used to store free form text > obtained by crawling new papers and blogs", it doesn't seem that way, so > I'm not sure faceting is what you want in this situation. > > Cheers, > Brendan > > > On Wed, May 22, 2013 at 9:49 PM, David Larochelle < > dlaroche...@cyber.law.harvard.edu> wrote: > > > I'm trying to quickly obtain cumulative word frequency counts over all > > documents matching a particular query. > > > > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 > GB > > and has around ~350,000 documents. > > > > My schema includes the following fields: > > > > <field name="id" type="string" indexed="true" stored="true" > required="true" > > multiValued="false" /> > > <field name="media_id" type="int" indexed="true" stored="true" > > required="true" multiValued="false" /> > > <field name="story_text" type="text_general" indexed="true" > stored="true" > > termVectors="true" termPositions="true" termOffsets="true" /> > > > > > > story_text is used to store free form text obtained by crawling new > papers > > and blogs. > > > > Running faceted searches with the fc or fcs methods fails with the error > > "Too many values for UnInvertedField faceting on field story_text" > > > > > http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs > > > > Running faceted search with the 'enum' method succeeds but takes a very > > long time. > > > > > http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 > > < > > > http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 > > > > > > > The frustrating thing is even if the query only returns a few hundred > > documents, it still takes 10 minutes or longer to get the cumulative word > > count results. > > > > Eventually we're hoping to build a system that will return results in a > few > > seconds and scale to hundreds of millions of documents. > > Is there anyway to get this level of performance out of Solr/Lucene? > > > > Thanks, > > > > David > > > > > > -- > Brendan Grainger > www.kuripai.com >