: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description 
is a bit vague, and sounds like it may be an "XY Problem"...

https://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

: 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a
: bug and I can use docID to uniquely identify a document, then my question is
: answered after this bug is fixed.
: 2. If docIds behavior described above is normal, then what is an alternative
: way of uniquely identify a document inside of a Similarity class during
: scoring? Can I get unique key of a scoring document in Similarity?

Assuming the method you are refering to (you didn't give a specific 
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

    doc - document id within the inverted index segment
    freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.

for #2: the methor for providing a SimScorer to lucene is by implementing 
Similarity.simScorer(...) -- that method gets as an argument an 
AtomicReaderContext context, which not only has an AtomicReader for the 
individual segment, but also details about that segments role in the 
larger index.

As far as getting the Solr uniqueKey ... it's non trivial, and there are 
different things you could do depending on what your ultimate goal is (ie: 
see my earlier question about XY problem) ... my guess is from this low 
level down in the code you want to use DocValues (aka: FieldCache in older 
versions of lucene) on your uniqueKey field, then ask it for the 
fieldvalue of each internal docId that gets passed to your method -- 
either by using the per-segment DocValues, or by using the 
AtomicReaderContext's base information to determine the "top level" 
internal docId and use the "top level" DocValues/FieldCache

(the per-segment vs "top level" DocValues and internalId stuff can be kind 
of confusing -- start with whichever seems simpler based on your 
understanding of the internal lucene/solr APIs and worry about maybe 
switching to the other approach later once you have something working and 
see if it helps or hinders performance for your usecases)

-Hoss
http://www.lucidworks.com/

Reply via email to