The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents  or the entire index.

--

David


On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
brendan.grain...@gmail.com> wrote:

> Hi David,
>
> Out of interest, what are you trying to accomplish by faceting over the
> story_text field? Is it generally the case that the story_text field will
> contain values that are repeated or categorize your documents somehow?
>  From your description: "story_text is used to store free form text
> obtained by crawling new papers and blogs", it doesn't seem that way, so
> I'm not sure faceting is what you want in this situation.
>
> Cheers,
> Brendan
>
>
> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
> dlaroche...@cyber.law.harvard.edu> wrote:
>
> > I'm trying to quickly obtain cumulative word frequency counts over all
> > documents matching a particular query.
> >
> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
> GB
> > and has around ~350,000 documents.
> >
> > My schema includes the following fields:
> >
> > <field name="id" type="string" indexed="true" stored="true"
> required="true"
> > multiValued="false" />
> > <field name="media_id" type="int" indexed="true" stored="true"
> > required="true" multiValued="false" />
> > <field name="story_text"  type="text_general" indexed="true"
> stored="true"
> > termVectors="true" termPositions="true" termOffsets="true" />
> >
> >
> > story_text is used to store free form text obtained by crawling new
> papers
> > and blogs.
> >
> > Running faceted searches with the fc or fcs methods fails with the error
> > "Too many values for UnInvertedField faceting on field story_text"
> >
> >
> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
> >
> > Running faceted search with the 'enum' method succeeds but takes a very
> > long time.
> >
> >
> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> > <
> >
> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
> > >
> >
> > The frustrating thing is even if the query only returns a few hundred
> > documents, it still takes 10 minutes or longer to get the cumulative word
> > count results.
> >
> > Eventually we're hoping to build a system that will return results in a
> few
> > seconds and scale to hundreds of millions of documents.
> > Is there anyway to get this level of performance out of Solr/Lucene?
> >
> > Thanks,
> >
> > David
> >
>
>
>
> --
> Brendan Grainger
> www.kuripai.com
>

Reply via email to