Getting unique key of a document inside of a Similarity class.
Good afternoon. I need to uniquely identify a document inside of a Similarity class during scoring. Is it possible to get value of unique key of a document at this point? For some time I though I can use internal docID for achieving that. Method score(int doc, float freq) is called after every query execution for each matched doc. For each indexed doc it equals 0, 1, 2, etc. But this is only when documents indexed in a bulk, i.e. in single HTTP request. But when docs are indexed in separate requests, these docIds equal 0 for all documents. To summarize, here are 2 final questions: 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a bug and I can use docID to uniquely identify a document, then my question is answered after this bug is fixed. 2. If docIds behavior described above is normal, then what is an alternative way of uniquely identify a document inside of a Similarity class during scoring? Can I get unique key of a scoring document in Similarity? FYI: I have asked 1st question in #solr IRC channel. The person named hoss answered the following: "you're seeing the *internal* docIds ... you can't assign any special meaning to them ... i believe that at the level of the Similarity class, these may even be per segment, which means that in the context of a SegmentReader they can be used to get things like docValues, but they odn't have any meaning compared to your uniqueKey (for example)". This kinda makes me think that answer for the 1st question is "it's a feature". But I am still not sure and don't know the answer to the 2nd question. Please help. Thank you very much in advance.
Re: Getting unique key of a document inside of a Similarity class.
Thank you for your answer, Chris. I will reply with inline comments as well. Please see below. : I need to uniquely identify a document inside of a Similarity class during : scoring. Is it possible to get value of unique key of a document at this : point? Can you tell us a bit more about your usecase ... your problem description is a bit vague, and sounds like it may be an "XY Problem"... Sure, sorry I did not do it before, I just wanted to take minimum of your valuable time. So in my custom Similarity class I am trying to implement such a logic, where score calculation is only based on field weight and a field match - that's it. In other words, if a field matches the query, I want "score" method to return this field's weight only, regardless of factors like: norms; coord; doc frequencies; fact that field was multivalued and more than one value matched; fact that field was tokenized as multiple tokens and more than one token matched, etc. As far as I know, there is no such a similarity in list of existing ones. In order to implement this, I am trying to score only once for a combination of a specific field + doc unique identifier. And I don't care what is this unique doc identifier - it can be unique key or it can be internal doc ID. I had my implementation working, but as I understood from your answer, I had it working only for one segment. So now I need to add segment ID or something like this to my combination. Assuming the method you are refering to (you didn't give a specific class/interface name) is SimScorer.score(doc,req) then the javadocs say... doc - document id within the inverted index segment freq - sloppy term frequency ...so for #1, yes this is definitely the per-segment docId. Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here we go, then I understand why it's 0 every new commit! SOLR doc says new docs are written to a new segment. Then question #1 is clear for me. Thanks, Chris! for #2: the methor for providing a SimScorer to lucene is by implementing Similarity.simScorer(...) -- that method gets as an argument an AtomicReaderContext context, which not only has an AtomicReader for the individual segment, but also details about that segments role in the larger index. Interesting details, that may be exactly what I need. If I can somehow uniquely identify a document using its internal doc id + data from context (like segment id or something), that would be awesome. I have checked AtomicReaderContext, it has 'ord' (The readers ord in the top-level's leaves array) and 'docBase' (The readers absolute doc base) - probably what I need. Do you have any more information (maybe links to wikis) about this AtomicReaderContext, DocValues, "low" and "top" levels (other than javadoc in source code)? I have a high-level understanding, but it's obviously not enough for the problem I am solving. I would be more than happy to understand it. Thank you very much for your time, Chris and other people who spend time on reading/answering this thread!
Re: Getting unique key of a document inside of a Similarity class.
how are you defining/specifying these field weights? I define weights inside of a query (name:SomeName^7). it would help if you could give a concrete example of some sample docs, a sample query, and what results you would expect ... the sample input and sample output of the system you are interested in. Sure. Imagine we have 2 docs: doc1 - name:DocumentOne place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created) doc2 - name:DocumentTwo place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created) I want the following queries return docs with scores: 1. name:DocumentOne^7 => doc1(score=7) 2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7) 3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3) 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10), doc2(score=3) If you're curious about why do I need it, i.e. about my very initial "problem X", then I need this scoring to be able to calculate matching percentage. That's a separate topic, I read a lot about it (including http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say it's either not doable or very-very complicated with SOLR. So I just want to give it a try. For case #3 from above matching percentage is 100% for both docs. For case #4 it's doc1:100% and doc2:30%. it's not clear why you need any sort of unique document identification for you scoring algorithm .. from what you described, matches on fieldA should get score "A" matches on fieldB should get score "B" ... why does it mater which doc is which? For case #3, for example, method SimScorer.score is called 3 times for each of these documents, total 6 times for both. I have added a ThreadLocal> to my custom similarity, which is cleared every time before new scoring session (after each query execution). This HashSet stores strings consisting of fieldName + docID. Every time score() is called, I check this HashSet - if fieldName + docID exists, I return 0 as score, otherwise field weight. If there was no docID in this string (only field name), then case #3 would return the following: doc1(score=3), doc2(score=0). If there was no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) since query matched all 3 tokens for every doc. I know that what I'm doing is a "hack", but that's the only way I've found so far to implement percentage matching. I just want to play around with it, see how it performs and decide whether to use it or not. But for that I need to uniquely identify a document while scoring :)
Re: Getting unique key of a document inside of a Similarity class.
from all the examples of what you've described, i'm fairly certain all you really need is a TFIDF based Similarity where coord(), idf(), tf() and queryNorm() return 1 allways, and you omitNorms from all fields. Yeah, that's what I did in the very first iteration. It works only for cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get: 3. place:(34\ High\ Street)^3 => doc1(score=9), doc2(score=9) 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=16), doc2(score=9) That is not what I need. As I described above, in case of multiple tokens match for a field, method SimScorer.score is called X times, where X is number of matched tokens (in cases #3 and #4 there are 3 tokens), therefore score sums up. I need to score only once in this case, regardless of number of tokens. How to do it? First idea was HashSet based on fieldName, so that after scoring once, it don't score anymore. But in this case only first document was scoring (since second and other documents have the same field name). So I understood that I need also docID for that. And it worked fine until I found out (thank you for that) about that docID is segment-specific. So now I need segmentID as well (or something similar). (You didn't give any examples of what you expect to happen with exclusion clauses in your BooleanQueries For my needs I won't need exclusion clauses, but in this case the same would happen - it would score depending on weight, because condition is true: 5. (NOT name:DocumentOne)^7 => doc2(score=7)