how are you defining/specifying these field weights?
I define weights inside of a query (name:SomeName^7).


it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.
Sure. Imagine we have 2 docs:

doc1
-----------------
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

doc2
-----------------
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

I want the following queries return docs with scores:

1. name:DocumentOne^7 => doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10), doc2(score=3)

If you're curious about why do I need it, i.e. about my very initial "problem X", then I need this scoring to be able to calculate matching percentage. That's a separate topic, I read a lot about it (including http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say it's either not doable or very-very complicated with SOLR. So I just want to give it a try. For case #3 from above matching percentage is 100% for both docs. For case #4 it's doc1:100% and doc2:30%.

it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score "A" matches on fieldB should get score "B" ... why does it mater
which doc is which?
For case #3, for example, method SimScorer.score is called 3 times for 
each of these documents, total 6 times for both. I have added a 
ThreadLocal<HashSet<String>> to my custom similarity, which is cleared 
every time before new scoring session (after each query execution). This 
HashSet stores strings consisting of fieldName + docID. Every time 
score() is called, I check this HashSet - if fieldName + docID exists, I 
return 0 as score, otherwise field weight.
If there was no docID in this string (only field name), then case #3 
would return the following: doc1(score=3), doc2(score=0). If there was 
no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) 
since query matched all 3 tokens for every doc.
I know that what I'm doing is a "hack", but that's the only way I've 
found so far to implement percentage matching. I just want to play 
around with it, see how it performs and decide whether to use it or not. 
But for that I need to uniquely identify a document while scoring :)


Reply via email to