On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote: > my client has a strange requirement, he will give a list of 500 words > and > then set a percentage like 80% now he want to find those pages or > documents which consist of the only those 80% of 500 and only 20% > unknown. > like we have this document > > word1 word2 word3 word4 > > and he give the list word1 word2 word3 and set the accuracy to 75% > the above doc will meet the criteria because no 1 it matches all words > and > only 25% words are unknow from the list of searches. > > here is another way to say this " if 500 words are provided in search > then > All 500 words words must exist in the document and unknow words should > be > only 20% if accracy is 80%"
As best as I can see, Solr can't quite do this, at least without enhancement. There's two parts to how Solr works - boolean querying, in which a document either matches, or doesn't. The first part is to work out how to select the documents you are interested in. The second part is scoring, which involves calculating a score for all of the documents that have got through the previous round. It seems the boolean portion could be achieved using minimum-should-match=100%. That is, all terms must be there. You can almost do the scoring portion by sorting on function queries, by sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc - that'd give you the number of times your query terms appear in the field, but the issue is there's no way to record the number of terms in a particular field. Perhaps you could pre-tokenise the field before indexing it, and store the number of terms in your index. Then, your score would be the sum of the termfreq(text, '<yourterms>') values, divided by the total number of terms in the document. Almost there, but the last leg is not quite. I don't know whether it is possible to write a fieldlength(text) function that returns the number of terms in the field. Upayavira