Thank you for your answer, Chris. I will reply with inline comments as well. Please see below.

: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description
is a bit vague, and sounds like it may be an "XY Problem"...

Sure, sorry I did not do it before, I just wanted to take minimum of your valuable time. So in my custom Similarity class I am trying to implement such a logic, where score calculation is only based on field weight and a field match - that's it. In other words, if a field matches the query, I want "score" method to return this field's weight only, regardless of factors like: norms; coord; doc frequencies; fact that field was multivalued and more than one value matched; fact that field was tokenized as multiple tokens and more than one token matched, etc. As far as I know, there is no such a similarity in list of existing ones. In order to implement this, I am trying to score only once for a combination of a specific field + doc unique identifier. And I don't care what is this unique doc identifier - it can be unique key or it can be internal doc ID. I had my implementation working, but as I understood from your answer, I had it working only for one segment. So now I need to add segment ID or something like this to my combination.


Assuming the method you are refering to (you didn't give a specific
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

     doc - document id within the inverted index segment
     freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.

Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here we go, then I understand why it's 0 every new commit! SOLR doc says new docs are written to a new segment. Then question #1 is clear for me. Thanks, Chris!


for #2: the methor for providing a SimScorer to lucene is by implementing
Similarity.simScorer(...) -- that method gets as an argument an
AtomicReaderContext context, which not only has an AtomicReader for the
individual segment, but also details about that segments role in the
larger index.

Interesting details, that may be exactly what I need. If I can somehow uniquely identify a document using its internal doc id + data from context (like segment id or something), that would be awesome. I have checked AtomicReaderContext, it has 'ord' (The readers ord in the top-level's leaves array) and 'docBase' (The readers absolute doc base) - probably what I need. Do you have any more information (maybe links to wikis) about this AtomicReaderContext, DocValues, "low" and "top" levels (other than javadoc in source code)? I have a high-level understanding, but it's obviously not enough for the problem I am solving. I would be more than happy to understand it.

Thank you very much for your time, Chris and other people who spend time on reading/answering this thread!

Reply via email to