Re: Getting unique key of a document inside of a Similarity class.

J-Pro Thu, 19 Feb 2015 16:04:48 -0800

how are you defining/specifying these field weights?


I define weights inside of a query (name:SomeName^7).

it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.


Sure. Imagine we have 2 docs:

doc1
-----------------
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

doc2
-----------------
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

I want the following queries return docs with scores:

1. name:DocumentOne^7 => doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)

4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10),doc2(score=3)

If you're curious about why do I need it, i.e. about my very initial"problem X", then I need this scoring to be able to calculate matchingpercentage. That's a separate topic, I read a lot about it (includinghttp://wiki.apache.org/lucene-java/ScoresAsPercentages) and people sayit's either not doable or very-very complicated with SOLR. So I justwant to give it a try. For case #3 from above matching percentage is100% for both docs. For case #4 it's doc1:100% and doc2:30%.

it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score "A" matches on fieldB should get score "B" ... why does it mater
which doc is which?

For case #3, for example, method SimScorer.score is called 3 times foreach of these documents, total 6 times for both. I have added aThreadLocal<HashSet<String>> to my custom similarity, which is clearedevery time before new scoring session (after each query execution). ThisHashSet stores strings consisting of fieldName + docID. Every timescore() is called, I check this HashSet - if fieldName + docID exists, Ireturn 0 as score, otherwise field weight.If there was no docID in this string (only field name), then case #3would return the following: doc1(score=3), doc2(score=0). If there wasno HashSet at all, case #3 would return: doc1(score=9), doc2(score=9)since query matched all 3 tokens for every doc.

I know that what I'm doing is a "hack", but that's the only way I'vefound so far to implement percentage matching. I just want to playaround with it, see how it performs and decide whether to use it or not.But for that I need to uniquely identify a document while scoring :)

Re: Getting unique key of a document inside of a Similarity class.

Reply via email to