Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
retur
I would fetch the term vectors for the top N documents and add them up myself.
You could even scale the term counts by the relevance score for the document.
That would avoid problems with analyzing ten documents where only the first
three were really good matches.
I did something similar in a d
Here's a possibility:
At index time extract important terms (and/or phrases) from this
story_text and store top N of them in a separate field (which will be
much smaller/shorter). Then facet on that. Or just retrieve it and
manually parse and count in the client if that turns out to be faster.
I
The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
p
Hi David,
Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
From your description: "story_text is used to store free form text
obt