Re: Fast faceting over large number of distinct terms

2013-05-23 Thread David Larochelle
Interesting solution. My concern is how to select the most frequent terms in the story_text field in a way that would make sense to the user. Only including the X most common non-stopword terms in a document could easily cause important patterns to be missed. There's a similar issue with only retur

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Walter Underwood
I would fetch the term vectors for the top N documents and add them up myself. You could even scale the term counts by the relevance score for the document. That would avoid problems with analyzing ten documents where only the first three were really good matches. I did something similar in a d

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Otis Gospodnetic
Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread David Larochelle
The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new p

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Brendan Grainger
Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: "story_text is used to store free form text obt