Re: Find documents that are composed of % words

Erick Erickson Thu, 10 Oct 2013 05:18:43 -0700

Just to add my $0.02. Often this kind of thing is
a mistaken assumption on the part of the client
that they know how to score documents better
than the really bright people who put a lot of time
and energy into scoring (note, I'm _certainly_
not one of those people!). I'll  often, instead of
making something like this work, see if I can
tweak the scoring for a "good enough" solution.
This can be a time-sink of the first magnitude for
very little actual benefit.


Very often, if you get "good enough" results and
put this kind of refinement on the back burner when
"more important" features are done it never seems
to percolate up to the point of needing work. And it's
a disservice to clients to agree to implementing
something like this without at least discussing
what you _won't_ be able to do if you do this.

Best,
Erick



On Thu, Oct 10, 2013 at 7:51 AM, Upayavira <u...@odoko.co.uk> wrote:
>
>
> On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote:
>> my client has a strange requirement,   he will give a list of 500 words
>> and
>> then set a percentage like 80%   now he want to find those pages or
>> documents which consist of the only those 80% of 500   and only 20%
>> unknown.
>> like   we have this document
>>
>>              word1 word2 word3 word4
>>
>> and he give the list  word1 word2 word3     and set the accuracy to 75%
>> the above doc will meet the criteria because no 1 it matches all words
>> and
>> only 25% words are unknow from the list of searches.
>>
>> here is another way to say this  " if 500 words are provided in search
>> then
>> All 500 words words must exist in the document  and unknow words should
>> be
>> only 20%  if accracy is 80%"
>
> As best as I can see, Solr can't quite do this, at least without
> enhancement.
>
> There's two parts to how Solr works - boolean querying, in which a
> document either matches, or doesn't. The first part is to work out how
> to select the documents you are interested in.
>
> The second part is scoring, which involves calculating a score for all
> of the documents that have got through the previous round.
>
> It seems the boolean portion could be achieved using
> minimum-should-match=100%. That is, all terms must be there.
>
> You can almost do the scoring portion by sorting on function queries, by
> sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc -
> that'd give you the number of times your query terms appear in the
> field, but the issue is there's no way to record the number of terms in
> a particular field.
>
> Perhaps you could pre-tokenise the field before indexing it, and store
> the number of terms in your index. Then, your score would be the sum of
> the termfreq(text, '<yourterms>') values, divided by the total number of
> terms in the document.
>
> Almost there, but the last leg is not quite.
>
> I don't know whether it is possible to write a fieldlength(text)
> function that returns the number of terms in the field.
>
> Upayavira

Re: Find documents that are composed of % words

Reply via email to