Just to add my $0.02. Often this kind of thing is a mistaken assumption on the part of the client that they know how to score documents better than the really bright people who put a lot of time and energy into scoring (note, I'm _certainly_ not one of those people!). I'll often, instead of making something like this work, see if I can tweak the scoring for a "good enough" solution. This can be a time-sink of the first magnitude for very little actual benefit.
Very often, if you get "good enough" results and put this kind of refinement on the back burner when "more important" features are done it never seems to percolate up to the point of needing work. And it's a disservice to clients to agree to implementing something like this without at least discussing what you _won't_ be able to do if you do this. Best, Erick On Thu, Oct 10, 2013 at 7:51 AM, Upayavira <u...@odoko.co.uk> wrote: > > > On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote: >> my client has a strange requirement, he will give a list of 500 words >> and >> then set a percentage like 80% now he want to find those pages or >> documents which consist of the only those 80% of 500 and only 20% >> unknown. >> like we have this document >> >> word1 word2 word3 word4 >> >> and he give the list word1 word2 word3 and set the accuracy to 75% >> the above doc will meet the criteria because no 1 it matches all words >> and >> only 25% words are unknow from the list of searches. >> >> here is another way to say this " if 500 words are provided in search >> then >> All 500 words words must exist in the document and unknow words should >> be >> only 20% if accracy is 80%" > > As best as I can see, Solr can't quite do this, at least without > enhancement. > > There's two parts to how Solr works - boolean querying, in which a > document either matches, or doesn't. The first part is to work out how > to select the documents you are interested in. > > The second part is scoring, which involves calculating a score for all > of the documents that have got through the previous round. > > It seems the boolean portion could be achieved using > minimum-should-match=100%. That is, all terms must be there. > > You can almost do the scoring portion by sorting on function queries, by > sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc - > that'd give you the number of times your query terms appear in the > field, but the issue is there's no way to record the number of terms in > a particular field. > > Perhaps you could pre-tokenise the field before indexing it, and store > the number of terms in your index. Then, your score would be the sum of > the termfreq(text, '<yourterms>') values, divided by the total number of > terms in the document. > > Almost there, but the last leg is not quite. > > I don't know whether it is possible to write a fieldlength(text) > function that returns the number of terms in the field. > > Upayavira