On 30-Aug-07, at 1:22 PM, Jed Reynolds wrote:
Jed Reynolds wrote:
Apologies if this is in the Lucene FAQ, but I was looking thru the
Lucene syntax and I just didn't see it.
Is there a way to search for documents that have a certain number
of occurrences of a term in the document? Like, I want to find all
documents that have the term Calico mentioned three or more times
in the document?
Apologies for the ignorant question. I believe what I'm looking to
do is filter results on term frequency. I of course can get term
frequency data from the debug output, but I'd rather not engage in
application-level filtering by parsing the debug output.
It looks like there could be a few ways to purse incorporating a
term frequency modifier into a search. I'd think that results could
be fq filtered thru the fq step, if I could change the fq step to
filter on term freq. I presume a QueryHandler could be made to do
that, too. I presume that a QueryParser and a Searcher could do the
job.
Any suggestions about a reasonable way to go about this would be
appreciated.
You could accomplish the goal without any coding by using phrase
queries: "calico calico calico"~10000 will match only documents that
have at least three occurrences of calico. If this is performant
enough, you are done. Otherwise, you'll have to do some custom coding.
One way would be to create your own Query subclass (similar to
TermQuery) that returned a score of zero for docs below a certain tf
threshold. This is probably the most efficient. Rather than
creating a custom queryparser, it probably would be easier to add an
extra parameter to a custom request handler than parsed
(<field>:<term>:<count>) into your custom query class add added it in
the appropriate place (eg. as a filter).
best,
-Mike