Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
returning counts for terms in the top N documents matching a particular
query.

Also is there an efficient way to add term counts on the client side? I
thought of using the TermVectorComponent to get document level frequency
counts and then using something like Hadoop to add them up. However, I
couldn't find any documentation on using the results of a solr query to
feed a map reduce operation.

--

David


On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Here's a possibility:
>
> At index time extract important terms (and/or phrases) from this
> story_text and store top N of them in a separate field (which will be
> much smaller/shorter).  Then facet on that.  Or just retrieve it and
> manually parse and count in the client if that turns out to be faster.
> I did this in the previous decade before Solr was available and it
> worked well.  I limited my counting to top N (200?) hits.
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, May 22, 2013 at 10:54 PM, David Larochelle
> <dlaroche...@cyber.law.harvard.edu> wrote:
> > The goal of the system is to obtain data that can be used to generate
> word
> > clouds so that users can quickly get a sense of the aggregate contents of
> > all documents matching a particular query. For example, a user might want
> > to see a word cloud of all documents discussing 'Iraq' in a particular
> new
> > papers.
> >
> > Faceting on story_text gives counts of individual words rather than
> entire
> > text strings. I think this is because of the tokenization that happens
> > automatically as part of the text_general type. I'm happy to look at
> > alternatives to faceting but I wasn't able to find one that
> > provided aggregate word counts for just the documents matching a
> particular
> > query rather than an individual documents  or the entire index.
> >
> > --
> >
> > David
> >
> >
> > On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
> > brendan.grain...@gmail.com> wrote:
> >
> >> Hi David,
> >>
> >> Out of interest, what are you trying to accomplish by faceting over the
> >> story_text field? Is it generally the case that the story_text field
> will
> >> contain values that are repeated or categorize your documents somehow?
> >>  From your description: "story_text is used to store free form text
> >> obtained by crawling new papers and blogs", it doesn't seem that way, so
> >> I'm not sure faceting is what you want in this situation.
> >>
> >> Cheers,
> >> Brendan
> >>
> >>
> >> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
> >> dlaroche...@cyber.law.harvard.edu> wrote:
> >>
> >> > I'm trying to quickly obtain cumulative word frequency counts over all
> >> > documents matching a particular query.
> >> >
> >> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is
> 2.5
> >> GB
> >> > and has around ~350,000 documents.
> >> >
> >> > My schema includes the following fields:
> >> >
> >> > <field name="id" type="string" indexed="true" stored="true"
> >> required="true"
> >> > multiValued="false" />
> >> > <field name="media_id" type="int" indexed="true" stored="true"
> >> > required="true" multiValued="false" />
> >> > <field name="story_text"  type="text_general" indexed="true"
> >> stored="true"
> >> > termVectors="true" termPositions="true" termOffsets="true" />
> >> >
> >> >
> >> > story_text is used to store free form text obtained by crawling new
> >> papers
> >> > and blogs.
> >> >
> >> > Running faceted searches with the fc or fcs methods fails with the
> error
> >> > "Too many values for UnInvertedField faceting on field story_text"
> >> >
> >> >
> >>
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
> >> >
> >> > Running faceted search with the 'enum' method succeeds but takes a
> very
> >> > long time.
> >> >
> >> >
> >>
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >> > <
> >> >
> >>
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> >> > >
> >> >
> >> > The frustrating thing is even if the query only returns a few hundred
> >> > documents, it still takes 10 minutes or longer to get the cumulative
> word
> >> > count results.
> >> >
> >> > Eventually we're hoping to build a system that will return results in
> a
> >> few
> >> > seconds and scale to hundreds of millions of documents.
> >> > Is there anyway to get this level of performance out of Solr/Lucene?
> >> >
> >> > Thanks,
> >> >
> >> > David
> >> >
> >>
> >>
> >>
> >> --
> >> Brendan Grainger
> >> www.kuripai.com
> >>
>

Reply via email to