how are you defining/specifying these field weights?
I define weights inside of a query (name:SomeName^7).
it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.
Sure. Imagine we have 2 docs:
doc1
-----------------
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)
doc2
-----------------
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)
I want the following queries return docs with scores:
1. name:DocumentOne^7 => doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10),
doc2(score=3)
If you're curious about why do I need it, i.e. about my very initial
"problem X", then I need this scoring to be able to calculate matching
percentage. That's a separate topic, I read a lot about it (including
http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say
it's either not doable or very-very complicated with SOLR. So I just
want to give it a try. For case #3 from above matching percentage is
100% for both docs. For case #4 it's doc1:100% and doc2:30%.
it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score "A" matches on fieldB should get score "B" ... why does it mater
which doc is which?
For case #3, for example, method SimScorer.score is called 3 times for
each of these documents, total 6 times for both. I have added a
ThreadLocal<HashSet<String>> to my custom similarity, which is cleared
every time before new scoring session (after each query execution). This
HashSet stores strings consisting of fieldName + docID. Every time
score() is called, I check this HashSet - if fieldName + docID exists, I
return 0 as score, otherwise field weight.
If there was no docID in this string (only field name), then case #3
would return the following: doc1(score=3), doc2(score=0). If there was
no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9)
since query matched all 3 tokens for every doc.
I know that what I'm doing is a "hack", but that's the only way I've
found so far to implement percentage matching. I just want to play
around with it, see how it performs and decide whether to use it or not.
But for that I need to uniquely identify a document while scoring :)