On 30-Aug-07, at 1:22 PM, Jed Reynolds wrote:

Jed Reynolds wrote:

Apologies if this is in the Lucene FAQ, but I was looking thru the Lucene syntax and I just didn't see it.

Is there a way to search for documents that have a certain number of occurrences of a term in the document? Like, I want to find all documents that have the term Calico mentioned three or more times in the document?

Apologies for the ignorant question. I believe what I'm looking to do is filter results on term frequency. I of course can get term frequency data from the debug output, but I'd rather not engage in application-level filtering by parsing the debug output.

It looks like there could be a few ways to purse incorporating a term frequency modifier into a search. I'd think that results could be fq filtered thru the fq step, if I could change the fq step to filter on term freq. I presume a QueryHandler could be made to do that, too. I presume that a QueryParser and a Searcher could do the job.

Any suggestions about a reasonable way to go about this would be appreciated.

You could accomplish the goal without any coding by using phrase queries: "calico calico calico"~10000 will match only documents that have at least three occurrences of calico. If this is performant enough, you are done. Otherwise, you'll have to do some custom coding.

One way would be to create your own Query subclass (similar to TermQuery) that returned a score of zero for docs below a certain tf threshold. This is probably the most efficient. Rather than creating a custom queryparser, it probably would be easier to add an extra parameter to a custom request handler than parsed (<field>:<term>:<count>) into your custom query class add added it in the appropriate place (eg. as a filter).

best,
-Mike

Reply via email to