Here's a possibility:

At index time extract important terms (and/or phrases) from this
story_text and store top N of them in a separate field (which will be
much smaller/shorter).  Then facet on that.  Or just retrieve it and
manually parse and count in the client if that turns out to be faster.
I did this in the previous decade before Solr was available and it
worked well.  I limited my counting to top N (200?) hits.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, May 22, 2013 at 10:54 PM, David Larochelle
<dlaroche...@cyber.law.harvard.edu> wrote:
> The goal of the system is to obtain data that can be used to generate word
> clouds so that users can quickly get a sense of the aggregate contents of
> all documents matching a particular query. For example, a user might want
> to see a word cloud of all documents discussing 'Iraq' in a particular new
> papers.
>
> Faceting on story_text gives counts of individual words rather than entire
> text strings. I think this is because of the tokenization that happens
> automatically as part of the text_general type. I'm happy to look at
> alternatives to faceting but I wasn't able to find one that
> provided aggregate word counts for just the documents matching a particular
> query rather than an individual documents  or the entire index.
>
> --
>
> David
>
>
> On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger <
> brendan.grain...@gmail.com> wrote:
>
>> Hi David,
>>
>> Out of interest, what are you trying to accomplish by faceting over the
>> story_text field? Is it generally the case that the story_text field will
>> contain values that are repeated or categorize your documents somehow?
>>  From your description: "story_text is used to store free form text
>> obtained by crawling new papers and blogs", it doesn't seem that way, so
>> I'm not sure faceting is what you want in this situation.
>>
>> Cheers,
>> Brendan
>>
>>
>> On Wed, May 22, 2013 at 9:49 PM, David Larochelle <
>> dlaroche...@cyber.law.harvard.edu> wrote:
>>
>> > I'm trying to quickly obtain cumulative word frequency counts over all
>> > documents matching a particular query.
>> >
>> > I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
>> GB
>> > and has around ~350,000 documents.
>> >
>> > My schema includes the following fields:
>> >
>> > <field name="id" type="string" indexed="true" stored="true"
>> required="true"
>> > multiValued="false" />
>> > <field name="media_id" type="int" indexed="true" stored="true"
>> > required="true" multiValued="false" />
>> > <field name="story_text"  type="text_general" indexed="true"
>> stored="true"
>> > termVectors="true" termPositions="true" termOffsets="true" />
>> >
>> >
>> > story_text is used to store free form text obtained by crawling new
>> papers
>> > and blogs.
>> >
>> > Running faceted searches with the fc or fcs methods fails with the error
>> > "Too many values for UnInvertedField faceting on field story_text"
>> >
>> >
>> http://localhost:8983/solr/query?q=id:106714828_6621&facet=true&facet.limit=10&facet.pivot=publish_date,story_text&rows=0&facet.method=fcs
>> >
>> > Running faceted search with the 'enum' method succeeds but takes a very
>> > long time.
>> >
>> >
>> http://localhost:8983/solr/query?q=includes:foobar&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>> > <
>> >
>> http://localhost:8983/solr/query?q=includes:mccain&facet=true&facet.limit=100&facet.pivot=media_id,includes&facet.method=enum&rows=0
>> > >
>> >
>> > The frustrating thing is even if the query only returns a few hundred
>> > documents, it still takes 10 minutes or longer to get the cumulative word
>> > count results.
>> >
>> > Eventually we're hoping to build a system that will return results in a
>> few
>> > seconds and scale to hundreds of millions of documents.
>> > Is there anyway to get this level of performance out of Solr/Lucene?
>> >
>> > Thanks,
>> >
>> > David
>> >
>>
>>
>>
>> --
>> Brendan Grainger
>> www.kuripai.com
>>

Reply via email to