Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I did this in the previous decade before Solr was available and it worked well. I limited my counting to top N (200?) hits.
Otis -- Solr & ElasticSearch Support http://sematext.com/ On Wed, May 22, 2013 at 10:54 PM, David Larochelle <dlaroche...@cyber.law.harvard.edu> wrote: > The goal of the system is to obtain data that can be used to generate word > clouds so that users can quickly get a sense of the aggregate contents of > all documents matching a particular query. For example, a user might want > to see a word cloud of all documents discussing 'Iraq' in a particular new > papers. > > Faceting on story_text gives counts of individual words rather than entire > text strings. I think this is because of the tokenization that happens > automatically as part of the text_general type. I'm happy to look at > alternatives to faceting but I wasn't able to find one that > provided aggregate word counts for just the documents matching a particular > query rather than an individual documents or the entire index. > > -- > > David > > > On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger < > brendan.grain...@gmail.com> wrote: > >> Hi David, >> >> Out of interest, what are you trying to accomplish by faceting over the >> story_text field? Is it generally the case that the story_text field will >> contain values that are repeated or categorize your documents somehow? >> From your description: "story_text is used to store free form text >> obtained by crawling new papers and blogs", it doesn't seem that way, so >> I'm not sure faceting is what you want in this situation. >> >> Cheers, >> Brendan >> >> >> On Wed, May 22, 2013 at 9:49 PM, David Larochelle < >> dlaroche...@cyber.law.harvard.edu> wrote: >> >> > I'm trying to quickly obtain cumulative word frequency counts over all >> > documents matching a particular query. >> > >> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 >> GB >> > and has around ~350,000 documents. >> > >> > My schema includes the following fields: >> > >> > <field name="id" type="string" indexed="true" stored="true" >> required="true" >> > multiValued="false" /> >> > <field name="media_id" type="int" indexed="true" stored="true" >> > required="true" multiValued="false" /> >> > <field name="story_text" type="text_general" indexed="true" >> stored="true" >> > termVectors="true" termPositions="true" termOffsets="true" /> >> > >> > >> > story_text is used to store free form text obtained by crawling new >> papers >> > and blogs. >> > >> > Running faceted searches with the fc or fcs methods fails with the error >> > "Too many values for UnInvertedField faceting on field story_text" >> > >> > >> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs >> > >> > Running faceted search with the 'enum' method succeeds but takes a very >> > long time. >> > >> > >> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 >> > < >> > >> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 >> > > >> > >> > The frustrating thing is even if the query only returns a few hundred >> > documents, it still takes 10 minutes or longer to get the cumulative word >> > count results. >> > >> > Eventually we're hoping to build a system that will return results in a >> few >> > seconds and scale to hundreds of millions of documents. >> > Is there anyway to get this level of performance out of Solr/Lucene? >> > >> > Thanks, >> > >> > David >> > >> >> >> >> -- >> Brendan Grainger >> www.kuripai.com >>