Re: Querying multivalued field - can scoring formula consider only matched values?

Svein Parnas Sat, 11 Oct 2008 00:56:29 -0700


On 7. okt.. 2008, at 21.49, abhishek007 wrote:

Hi,
My application needs to handle synonyms for courses. The mostnatural way to
achieve this would be having the field "course" to be multivalued.

Now, say I add documents  like:

<document>
 <field name="professor">John Dane</field>
 <field name="course">Algorithms</field>
 <field name="course">Theory</field>
 <field name="course">Computability, Complexity and Algorithms</field>
</document>

<document>
 <field name="professor">Mary Arriaga</field>
 <field name="course">Algorithms for Pattern Matching</field>
</document>
Now, if I query for "Algorithms", I get a higher score for document2 than
document 1.
1) I have noticed that this is because length norm factor of lucenescoringconsiders all values of the mutivalued field, which is reducing theoverall
score of document 1. How can I avoid this?
2) Is there a alternate way to achieve what I want here? I can thinkof
changing the schema of my index by making the field "course" as
single-valued and creating separate documents for each synonym for acourse.
But wont that explode the index size.

One way to boost exact match of one occurrence of a multivalued fieldis to add some kind of special start-of-field token and end-of-fieldtoken in the data, eg:


<document>
 <field name="professor">John Dane</field>
 <field name="course">softok Algorithms eoftok</field>
 <field name="course">softok Theory eoftok</field>

<field name="course">softok Computability, Complexity and Algorithmseoftok</field>

</document>

Then, in your query you can boost hits with the complete phrase"softok queryword eoftok" by doing something like


queryword OR "softok queryword eoftok"^10

If you want to boost shorter fields in general and not only exatmatch, add some distance to the phrase part.


Of course, this will have a cost with regards to performance.

Could any of you Lucene experts out there explain to me why it isn'tpossible to do field boosting per occurrence. I know Solr doesn´tsupport it because Lucene doesn´t, but I can´t figure out theunderlying reason. I think even a per-token kind of boosting (e.g.supporting someting like foobar^10 at indexing time) should be easy toimplement in the Lucene relevance model and would have been very useful.


Svein

Re: Querying multivalued field - can scoring formula consider only matched values?

Reply via email to