On 7. okt.. 2008, at 21.49, abhishek007 wrote:


Hi,
My application needs to handle synonyms for courses. The most natural way to
achieve this would be having the field "course" to be multivalued.

Now, say I add documents  like:

<document>
 <field name="professor">John Dane</field>
 <field name="course">Algorithms</field>
 <field name="course">Theory</field>
 <field name="course">Computability, Complexity and Algorithms</field>
</document>

<document>
 <field name="professor">Mary Arriaga</field>
 <field name="course">Algorithms for Pattern Matching</field>
</document>

Now, if I query for "Algorithms", I get a higher score for document 2 than
document 1.

1) I have noticed that this is because length norm factor of lucene scoring considers all values of the mutivalued field, which is reducing the overall
score of document 1. How can I avoid this?

2) Is there a alternate way to achieve what I want here? I can think of
changing the schema of my index by making the field "course" as
single-valued and creating separate documents for each synonym for a course.
But wont that explode the index size.


One way to boost exact match of one occurrence of a multivalued field is to add some kind of special start-of-field token and end-of-field token in the data, eg:

<document>
 <field name="professor">John Dane</field>
 <field name="course">softok Algorithms eoftok</field>
 <field name="course">softok Theory eoftok</field>
<field name="course">softok Computability, Complexity and Algorithms eoftok</field>
</document>

Then, in your query you can boost hits with the complete phrase "softok queryword eoftok" by doing something like

queryword OR "softok queryword eoftok"^10

If you want to boost shorter fields in general and not only exat match, add some distance to the phrase part.

Of course, this will have a cost with regards to performance.

Could any of you Lucene experts out there explain to me why it isn't possible to do field boosting per occurrence. I know Solr doesn´t support it because Lucene doesn´t, but I can´t figure out the underlying reason. I think even a per-token kind of boosting (e.g. supporting someting like foobar^10 at indexing time) should be easy to implement in the Lucene relevance model and would have been very useful.

Svein

Reply via email to