Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Jefferson French Tue, 11 Mar 2014 07:19:29 -0700

Thank you, Robert. You are right, I was confused between the two. I also
didn't know the "storeOffsetsWithPositions" existed. My code works as I
expected now.



On Mon, Mar 10, 2014 at 11:11 PM, Robert Muir <rcm...@gmail.com> wrote:

> Hello, I think you are confused between two different index
> structures, probably because of the name of the options in solr.
>
> 1. indexing term vectors: this means given a document, you can go
> lookup a miniature "inverted index" just for that document. That means
> each document has "term vectors" which has a term dictionary of the
> terms in that one document, and optionally things like positions and
> character offsets. This can be useful if you are examining *many
> terms* for just a few documents. For example: the MoreLikeThis use
> case. In solr this is activated with termVectors=true. To additionally
> store positions/offsets information inside the term vectors its
> termPositions and termOffsets, respectively.
>
> 2. indexing character offsets: this means given a term, you can get
> the offset information "along with" each position that matched. So
> really you can think of this as a special form of a payload. This is
> useful if you are examining *many documents* for just a few terms. For
> example, many highlighting use cases. In solr this is activated with
> storeOffsetsWithPositions=true. It is unrelated to term vectors.
>
> Hopefully this helps.
>
> On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French <jkfaus...@gmail.com>
> wrote:
> > This looks like a codec issue, but I'm not sure how to address it. I've
> > found that a different instance of DocsAndPositionsEnum is instantiated
> > between my code and Solr's TermVectorComponent.
> >
> > Mine:
> > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum
> > Solr:
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum
> >
> > As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure
> where
> > the Lucene 4.1 reference comes from. I've searched through the Solr
> config
> > files and can't see where to change the codec, but shouldn't the reader
> use
> > the same codec as used when the index was created?
> >
> >
> > On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French <jkfaus...@gmail.com
> >wrote:
> >
> >> We have an API on top of Lucene 4.6 that I'm trying to adapt to running
> >> under Solr 4.6. The problem is although I'm getting the correct offsets
> >> when the index is created by Lucene, the same method calls always
> return -1
> >> when the index is created by Solr. In the latter case I can see the
> >> character offsets via Luke, and I can even get them from Solr when I
> access
> >> the /tvrh search handler, which uses the TermVectorComponent class.
> >>
> >> This is roughly how I'm reading character offsets in my Lucene code:
> >>
> >>> AtomicReader reader = ...
> >>> Term term = ...
> >>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
> >>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
> >>>   for (int i = 0; i < postings.freq(); i++) {
> >>>     System.out.println("start:" + postings.startOffset());
> >>>     System.out.println("end:" + postings.endOffset());
> >>>   }
> >>> }
> >>
> >>
> >> Notice that I want the values for a single term. When run against an
> index
> >> created by Solr, the above calls to startOffset() and endOffset() return
> >> -1. Solr's TermVectorComponent prints the correct offsets like this
> >> (paraphrased):
> >>
> >> IndexReader reader = searcher.getIndexReader();
> >>> Terms vector = reader.getTermVector(docId, field);
> >>> TermsEnum termsEnum = vector.iterator(termsEnum);
> >>> int freq = (int) termsEnum.totalTermFreq();
> >>> DocsAndPositionsEnum dpEnum = null;
> >>> while((text = termsEnum.next()) != null) {
> >>>   String term = text.utf8ToString();
> >>>   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
> >>>   dpEnum.nextDoc();
> >>>   for (int i = 0; i < freq; i++) {
> >>>     final int pos = dpEnum.nextPosition();
> >>>     System.out.println("start:" + dpEnum.startOffset());
> >>>     System.out.println("end:" + dpEnum.endOffset());
> >>>   }
> >>> }
> >>
> >>
> >> but in this case it is getting the offsets per doc ID, rather than a
> >> single term, which is what I want.
> >>
> >> Could anyone tell me:
> >>
> >>    1. Why I'm not able to get the offsets using my first example, and/or
> >>    2. A better way to get the offsets for a given term?
> >>
> >> Thanks.
> >>
> >>        Jeff
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

Reply via email to