Right - aside from the interesting intellectual exercise, the correct question to ask is, "why?"
Why would you want to do this? What's the benefit, and is there a way of doing it that is more in keeping with how Solr has been designed? Upayavira On Thu, Oct 10, 2013, at 01:17 PM, Erick Erickson wrote: > Just to add my $0.02. Often this kind of thing is > a mistaken assumption on the part of the client > that they know how to score documents better > than the really bright people who put a lot of time > and energy into scoring (note, I'm _certainly_ > not one of those people!). I'll often, instead of > making something like this work, see if I can > tweak the scoring for a "good enough" solution. > This can be a time-sink of the first magnitude for > very little actual benefit. > > Very often, if you get "good enough" results and > put this kind of refinement on the back burner when > "more important" features are done it never seems > to percolate up to the point of needing work. And it's > a disservice to clients to agree to implementing > something like this without at least discussing > what you _won't_ be able to do if you do this. > > Best, > Erick > > > > On Thu, Oct 10, 2013 at 7:51 AM, Upayavira <u...@odoko.co.uk> wrote: > > > > > > On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote: > >> my client has a strange requirement, he will give a list of 500 words > >> and > >> then set a percentage like 80% now he want to find those pages or > >> documents which consist of the only those 80% of 500 and only 20% > >> unknown. > >> like we have this document > >> > >> word1 word2 word3 word4 > >> > >> and he give the list word1 word2 word3 and set the accuracy to 75% > >> the above doc will meet the criteria because no 1 it matches all words > >> and > >> only 25% words are unknow from the list of searches. > >> > >> here is another way to say this " if 500 words are provided in search > >> then > >> All 500 words words must exist in the document and unknow words should > >> be > >> only 20% if accracy is 80%" > > > > As best as I can see, Solr can't quite do this, at least without > > enhancement. > > > > There's two parts to how Solr works - boolean querying, in which a > > document either matches, or doesn't. The first part is to work out how > > to select the documents you are interested in. > > > > The second part is scoring, which involves calculating a score for all > > of the documents that have got through the previous round. > > > > It seems the boolean portion could be achieved using > > minimum-should-match=100%. That is, all terms must be there. > > > > You can almost do the scoring portion by sorting on function queries, by > > sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc - > > that'd give you the number of times your query terms appear in the > > field, but the issue is there's no way to record the number of terms in > > a particular field. > > > > Perhaps you could pre-tokenise the field before indexing it, and store > > the number of terms in your index. Then, your score would be the sum of > > the termfreq(text, '<yourterms>') values, divided by the total number of > > terms in the document. > > > > Almost there, but the last leg is not quite. > > > > I don't know whether it is possible to write a fieldlength(text) > > function that returns the number of terms in the field. > > > > Upayavira