Interesting solution. My concern is how to select the most frequent terms in the story_text field in a way that would make sense to the user. Only including the X most common non-stopword terms in a document could easily cause important patterns to be missed. There's a similar issue with only returning counts for terms in the top N documents matching a particular query.
Also is there an efficient way to add term counts on the client side? I thought of using the TermVectorComponent to get document level frequency counts and then using something like Hadoop to add them up. However, I couldn't find any documentation on using the results of a solr query to feed a map reduce operation. -- David On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Here's a possibility: > > At index time extract important terms (and/or phrases) from this > story_text and store top N of them in a separate field (which will be > much smaller/shorter). Then facet on that. Or just retrieve it and > manually parse and count in the client if that turns out to be faster. > I did this in the previous decade before Solr was available and it > worked well. I limited my counting to top N (200?) hits. > > Otis > -- > Solr & ElasticSearch Support > http://sematext.com/ > > > > > > On Wed, May 22, 2013 at 10:54 PM, David Larochelle > <dlaroche...@cyber.law.harvard.edu> wrote: > > The goal of the system is to obtain data that can be used to generate > word > > clouds so that users can quickly get a sense of the aggregate contents of > > all documents matching a particular query. For example, a user might want > > to see a word cloud of all documents discussing 'Iraq' in a particular > new > > papers. > > > > Faceting on story_text gives counts of individual words rather than > entire > > text strings. I think this is because of the tokenization that happens > > automatically as part of the text_general type. I'm happy to look at > > alternatives to faceting but I wasn't able to find one that > > provided aggregate word counts for just the documents matching a > particular > > query rather than an individual documents or the entire index. > > > > -- > > > > David > > > > > > On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger < > > brendan.grain...@gmail.com> wrote: > > > >> Hi David, > >> > >> Out of interest, what are you trying to accomplish by faceting over the > >> story_text field? Is it generally the case that the story_text field > will > >> contain values that are repeated or categorize your documents somehow? > >> From your description: "story_text is used to store free form text > >> obtained by crawling new papers and blogs", it doesn't seem that way, so > >> I'm not sure faceting is what you want in this situation. > >> > >> Cheers, > >> Brendan > >> > >> > >> On Wed, May 22, 2013 at 9:49 PM, David Larochelle < > >> dlaroche...@cyber.law.harvard.edu> wrote: > >> > >> > I'm trying to quickly obtain cumulative word frequency counts over all > >> > documents matching a particular query. > >> > > >> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is > 2.5 > >> GB > >> > and has around ~350,000 documents. > >> > > >> > My schema includes the following fields: > >> > > >> > <field name="id" type="string" indexed="true" stored="true" > >> required="true" > >> > multiValued="false" /> > >> > <field name="media_id" type="int" indexed="true" stored="true" > >> > required="true" multiValued="false" /> > >> > <field name="story_text" type="text_general" indexed="true" > >> stored="true" > >> > termVectors="true" termPositions="true" termOffsets="true" /> > >> > > >> > > >> > story_text is used to store free form text obtained by crawling new > >> papers > >> > and blogs. > >> > > >> > Running faceted searches with the fc or fcs methods fails with the > error > >> > "Too many values for UnInvertedField faceting on field story_text" > >> > > >> > > >> > http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs > >> > > >> > Running faceted search with the 'enum' method succeeds but takes a > very > >> > long time. > >> > > >> > > >> > http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 > >> > < > >> > > >> > http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0 > >> > > > >> > > >> > The frustrating thing is even if the query only returns a few hundred > >> > documents, it still takes 10 minutes or longer to get the cumulative > word > >> > count results. > >> > > >> > Eventually we're hoping to build a system that will return results in > a > >> few > >> > seconds and scale to hundreds of millions of documents. > >> > Is there anyway to get this level of performance out of Solr/Lucene? > >> > > >> > Thanks, > >> > > >> > David > >> > > >> > >> > >> > >> -- > >> Brendan Grainger > >> www.kuripai.com > >> >