TermsQuery works by pulling the postings lists for each term and OR-ing them
together to create a bitset, which is very memory-efficient but means that you
don't know at doc collection time which term has actually matched.
For your case you probably want to create a SpanOrQuery, and then iterate
Let's say we're trying to do document to document matching (not with
MLT). We have a shingling analysis chain. The query is a document, which
is itself shingled. We then look up those shingles in the index. The %
of shingles found is in some sense a marker as to the extent to which
the documents ar
Or a really simple--minded approach, just use the frequency
as a ration of numFound to estimate terms.
Doesn't work of course if you need precise counts.
On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
wrote:
> How precise do you need to be?
>
> I wonder if you could efficiently approximate "numbe
How precise do you need to be?
I wonder if you could efficiently approximate "number of matches" by
getting the document frequency of each term. I realize this is an
approximation, but the highest document frequency would be your floor.
Let's say you have terms t1, t2, and t3 ... tn. t1 has highe