On 7. okt.. 2008, at 21.49, abhishek007 wrote:
Hi,
My application needs to handle synonyms for courses. The most
natural way to
achieve this would be having the field "course" to be multivalued.
Now, say I add documents like:
<document>
<field name="professor">John Dane</field>
<field name="course">Algorithms</field>
<field name="course">Theory</field>
<field name="course">Computability, Complexity and Algorithms</field>
</document>
<document>
<field name="professor">Mary Arriaga</field>
<field name="course">Algorithms for Pattern Matching</field>
</document>
Now, if I query for "Algorithms", I get a higher score for document
2 than
document 1.
1) I have noticed that this is because length norm factor of lucene
scoring
considers all values of the mutivalued field, which is reducing the
overall
score of document 1. How can I avoid this?
2) Is there a alternate way to achieve what I want here? I can think
of
changing the schema of my index by making the field "course" as
single-valued and creating separate documents for each synonym for a
course.
But wont that explode the index size.
One way to boost exact match of one occurrence of a multivalued field
is to add some kind of special start-of-field token and end-of-field
token in the data, eg:
<document>
<field name="professor">John Dane</field>
<field name="course">softok Algorithms eoftok</field>
<field name="course">softok Theory eoftok</field>
<field name="course">softok Computability, Complexity and Algorithms
eoftok</field>
</document>
Then, in your query you can boost hits with the complete phrase
"softok queryword eoftok" by doing something like
queryword OR "softok queryword eoftok"^10
If you want to boost shorter fields in general and not only exat
match, add some distance to the phrase part.
Of course, this will have a cost with regards to performance.
Could any of you Lucene experts out there explain to me why it isn't
possible to do field boosting per occurrence. I know Solr doesn´t
support it because Lucene doesn´t, but I can´t figure out the
underlying reason. I think even a per-token kind of boosting (e.g.
supporting someting like foobar^10 at indexing time) should be easy to
implement in the Lucene relevance model and would have been very useful.
Svein