Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Jefferson French Mon, 10 Mar 2014 18:33:28 -0700

This looks like a codec issue, but I'm not sure how to address it. I've
found that a different instance of DocsAndPositionsEnum is instantiated
between my code and Solr's TermVectorComponent.


Mine:
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum
Solr: 
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum

As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where
the Lucene 4.1 reference comes from. I've searched through the Solr config
files and can't see where to change the codec, but shouldn't the reader use
the same codec as used when the index was created?


On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French <jkfaus...@gmail.com>wrote:

> We have an API on top of Lucene 4.6 that I'm trying to adapt to running
> under Solr 4.6. The problem is although I'm getting the correct offsets
> when the index is created by Lucene, the same method calls always return -1
> when the index is created by Solr. In the latter case I can see the
> character offsets via Luke, and I can even get them from Solr when I access
> the /tvrh search handler, which uses the TermVectorComponent class.
>
> This is roughly how I'm reading character offsets in my Lucene code:
>
>> AtomicReader reader = ...
>> Term term = ...
>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
>>   for (int i = 0; i < postings.freq(); i++) {
>>     System.out.println("start:" + postings.startOffset());
>>     System.out.println("end:" + postings.endOffset());
>>   }
>> }
>
>
> Notice that I want the values for a single term. When run against an index
> created by Solr, the above calls to startOffset() and endOffset() return
> -1. Solr's TermVectorComponent prints the correct offsets like this
> (paraphrased):
>
> IndexReader reader = searcher.getIndexReader();
>> Terms vector = reader.getTermVector(docId, field);
>> TermsEnum termsEnum = vector.iterator(termsEnum);
>> int freq = (int) termsEnum.totalTermFreq();
>> DocsAndPositionsEnum dpEnum = null;
>> while((text = termsEnum.next()) != null) {
>>   String term = text.utf8ToString();
>>   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
>>   dpEnum.nextDoc();
>>   for (int i = 0; i < freq; i++) {
>>     final int pos = dpEnum.nextPosition();
>>     System.out.println("start:" + dpEnum.startOffset());
>>     System.out.println("end:" + dpEnum.endOffset());
>>   }
>> }
>
>
> but in this case it is getting the offsets per doc ID, rather than a
> single term, which is what I want.
>
> Could anyone tell me:
>
>    1. Why I'm not able to get the offsets using my first example, and/or
>    2. A better way to get the offsets for a given term?
>
> Thanks.
>
>        Jeff
>
>
>
>
>
>
>
>
>

Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Reply via email to