Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

Good afternoon.

I need to uniquely identify a document inside of a Similarity class 
during scoring. Is it possible to get value of unique key of a document 
at this point?


For some time I though I can use internal docID for achieving that. 
Method score(int doc, float freq) is called after every query execution 
for each matched doc. For each indexed doc it equals 0, 1, 2, etc. But 
this is only when documents indexed in a bulk, i.e. in single HTTP 
request. But when docs are indexed in separate requests, these docIds 
equal 0 for all documents.


To summarize, here are 2 final questions:

1. Is docIds behavior described above a bug or a feature? Obviously, if 
it's a bug and I can use docID to uniquely identify a document, then my 
question is answered after this bug is fixed.
2. If docIds behavior described above is normal, then what is an 
alternative way of uniquely identify a document inside of a Similarity 
class during scoring? Can I get unique key of a scoring document in 
Similarity?


FYI: I have asked 1st question in #solr IRC channel. The person named 
hoss answered the following: "you're seeing the *internal* docIds ... 
you can't assign any special meaning to them ... i believe that at the 
level of the Similarity class, these may even be per segment, which 
means that in the context of a SegmentReader they can be used to get 
things like docValues, but they odn't have any meaning compared to your 
uniqueKey (for example)". This kinda makes me think that answer for the 
1st question is "it's a feature". But I am still not sure and don't know 
the answer to the 2nd question. Please help.


Thank you very much in advance.


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro
Thank you for your answer, Chris. I will reply with inline comments as 
well. Please see below.



: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description
is a bit vague, and sounds like it may be an "XY Problem"...


Sure, sorry I did not do it before, I just wanted to take minimum of 
your valuable time. So in my custom Similarity class I am trying to 
implement such a logic, where score calculation is only based on field 
weight and a field match - that's it. In other words, if a field matches 
the query, I want "score" method to return this field's weight only, 
regardless of factors like: norms; coord; doc frequencies; fact that 
field was multivalued and more than one value matched; fact that field 
was tokenized as multiple tokens and more than one token matched, etc. 
As far as I know, there is no such a similarity in list of existing ones.
In order to implement this, I am trying to score only once for a 
combination of a specific field + doc unique identifier. And I don't 
care what is this unique doc identifier - it can be unique key or it can 
be internal doc ID.
I had my implementation working, but as I understood from your answer, I 
had it working only for one segment. So now I need to add segment ID or 
something like this to my combination.




Assuming the method you are refering to (you didn't give a specific
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

 doc - document id within the inverted index segment
 freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.


Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here 
we go, then I understand why it's 0 every new commit! SOLR doc says new 
docs are written to a new segment. Then question #1 is clear for me. 
Thanks, Chris!




for #2: the methor for providing a SimScorer to lucene is by implementing
Similarity.simScorer(...) -- that method gets as an argument an
AtomicReaderContext context, which not only has an AtomicReader for the
individual segment, but also details about that segments role in the
larger index.


Interesting details, that may be exactly what I need. If I can somehow 
uniquely identify a document using its internal doc id + data from 
context (like segment id or something), that would be awesome. I have 
checked AtomicReaderContext, it has 'ord' (The readers ord in the 
top-level's leaves array) and 'docBase' (The readers absolute doc base) 
- probably what I need. Do you have any more information (maybe links to 
wikis) about this AtomicReaderContext, DocValues, "low" and "top" levels 
(other than javadoc in source code)? I have a high-level understanding, 
but it's obviously not enough for the problem I am solving. I would be 
more than happy to understand it.


Thank you very much for your time, Chris and other people who spend time 
on reading/answering this thread!


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

how are you defining/specifying these field weights?


I define weights inside of a query (name:SomeName^7).



it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.


Sure. Imagine we have 2 docs:

doc1
-
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

doc2
-
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

I want the following queries return docs with scores:

1. name:DocumentOne^7 => doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10), 
doc2(score=3)



If you're curious about why do I need it, i.e. about my very initial 
"problem X", then I need this scoring to be able to calculate matching 
percentage. That's a separate topic, I read a lot about it (including 
http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say 
it's either not doable or very-very complicated with SOLR. So I just 
want to give it a try. For case #3 from above matching percentage is 
100% for both docs. For case #4 it's doc1:100% and doc2:30%.




it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score "A" matches on fieldB should get score "B" ... why does it mater
which doc is which?


For case #3, for example, method SimScorer.score is called 3 times for 
each of these documents, total 6 times for both. I have added a 
ThreadLocal> to my custom similarity, which is cleared 
every time before new scoring session (after each query execution). This 
HashSet stores strings consisting of fieldName + docID. Every time 
score() is called, I check this HashSet - if fieldName + docID exists, I 
return 0 as score, otherwise field weight.
If there was no docID in this string (only field name), then case #3 
would return the following: doc1(score=3), doc2(score=0). If there was 
no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) 
since query matched all 3 tokens for every doc.


I know that what I'm doing is a "hack", but that's the only way I've 
found so far to implement percentage matching. I just want to play 
around with it, see how it performs and decide whether to use it or not. 
But for that I need to uniquely identify a document while scoring :)


Re: Getting unique key of a document inside of a Similarity class.

2015-02-20 Thread J-Pro

from all the examples of what you've described, i'm fairly certain all you
really need is a TFIDF based Similarity where coord(), idf(), tf() and
queryNorm() return 1 allways, and you omitNorms from all fields.


Yeah, that's what I did in the very first iteration. It works only for 
cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get:


3. place:(34\ High\ Street)^3 => doc1(score=9), doc2(score=9)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=16), 
doc2(score=9)


That is not what I need. As I described above, in case of multiple 
tokens match for a field, method SimScorer.score is called X times, 
where X is number of matched tokens (in cases #3 and #4 there are 3 
tokens), therefore score sums up. I need to score only once in this 
case, regardless of number of tokens.


How to do it? First idea was HashSet based on fieldName, so that after 
scoring once, it don't score anymore. But in this case only first 
document was scoring (since second and other documents have the same 
field name). So I understood that I need also docID for that. And it 
worked fine until I found out (thank you for that) about that docID is 
segment-specific. So now I need segmentID as well (or something similar).




(You didn't give any examples of what you expect to happen with exclusion
clauses in your BooleanQueries


For my needs I won't need exclusion clauses, but in this case the same 
would happen - it would score depending on weight, because condition is 
true:


5. (NOT name:DocumentOne)^7 => doc2(score=7)