Unable to get offsets using AtomicReader.termPositionsEnum(Term)
We have an API on top of Lucene 4.6 that I'm trying to adapt to running under Solr 4.6. The problem is although I'm getting the correct offsets when the index is created by Lucene, the same method calls always return -1 when the index is created by Solr. In the latter case I can see the character offsets via Luke, and I can even get them from Solr when I access the /tvrh search handler, which uses the TermVectorComponent class. This is roughly how I'm reading character offsets in my Lucene code: > AtomicReader reader = ... > Term term = ... > DocsAndPositionsEnum postings = reader.termPositionsEnum(term); > while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { > for (int i = 0; i < postings.freq(); i++) { > System.out.println("start:" + postings.startOffset()); > System.out.println("end:" + postings.endOffset()); > } > } Notice that I want the values for a single term. When run against an index created by Solr, the above calls to startOffset() and endOffset() return -1. Solr's TermVectorComponent prints the correct offsets like this (paraphrased): IndexReader reader = searcher.getIndexReader(); > Terms vector = reader.getTermVector(docId, field); > TermsEnum termsEnum = vector.iterator(termsEnum); > int freq = (int) termsEnum.totalTermFreq(); > DocsAndPositionsEnum dpEnum = null; > while((text = termsEnum.next()) != null) { > String term = text.utf8ToString(); > dpEnum = termsEnum.docsAndPositions(null, dpEnum); > dpEnum.nextDoc(); > for (int i = 0; i < freq; i++) { > final int pos = dpEnum.nextPosition(); > System.out.println("start:" + dpEnum.startOffset()); > System.out.println("end:" + dpEnum.endOffset()); > } > } but in this case it is getting the offsets per doc ID, rather than a single term, which is what I want. Could anyone tell me: 1. Why I'm not able to get the offsets using my first example, and/or 2. A better way to get the offsets for a given term? Thanks. Jeff
Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)
This looks like a codec issue, but I'm not sure how to address it. I've found that a different instance of DocsAndPositionsEnum is instantiated between my code and Solr's TermVectorComponent. Mine: org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum Solr: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where the Lucene 4.1 reference comes from. I've searched through the Solr config files and can't see where to change the codec, but shouldn't the reader use the same codec as used when the index was created? On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French wrote: > We have an API on top of Lucene 4.6 that I'm trying to adapt to running > under Solr 4.6. The problem is although I'm getting the correct offsets > when the index is created by Lucene, the same method calls always return -1 > when the index is created by Solr. In the latter case I can see the > character offsets via Luke, and I can even get them from Solr when I access > the /tvrh search handler, which uses the TermVectorComponent class. > > This is roughly how I'm reading character offsets in my Lucene code: > >> AtomicReader reader = ... >> Term term = ... >> DocsAndPositionsEnum postings = reader.termPositionsEnum(term); >> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { >> for (int i = 0; i < postings.freq(); i++) { >> System.out.println("start:" + postings.startOffset()); >> System.out.println("end:" + postings.endOffset()); >> } >> } > > > Notice that I want the values for a single term. When run against an index > created by Solr, the above calls to startOffset() and endOffset() return > -1. Solr's TermVectorComponent prints the correct offsets like this > (paraphrased): > > IndexReader reader = searcher.getIndexReader(); >> Terms vector = reader.getTermVector(docId, field); >> TermsEnum termsEnum = vector.iterator(termsEnum); >> int freq = (int) termsEnum.totalTermFreq(); >> DocsAndPositionsEnum dpEnum = null; >> while((text = termsEnum.next()) != null) { >> String term = text.utf8ToString(); >> dpEnum = termsEnum.docsAndPositions(null, dpEnum); >> dpEnum.nextDoc(); >> for (int i = 0; i < freq; i++) { >> final int pos = dpEnum.nextPosition(); >> System.out.println("start:" + dpEnum.startOffset()); >> System.out.println("end:" + dpEnum.endOffset()); >> } >> } > > > but in this case it is getting the offsets per doc ID, rather than a > single term, which is what I want. > > Could anyone tell me: > >1. Why I'm not able to get the offsets using my first example, and/or >2. A better way to get the offsets for a given term? > > Thanks. > >Jeff > > > > > > > > >
Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)
Thank you, Robert. You are right, I was confused between the two. I also didn't know the "storeOffsetsWithPositions" existed. My code works as I expected now. On Mon, Mar 10, 2014 at 11:11 PM, Robert Muir wrote: > Hello, I think you are confused between two different index > structures, probably because of the name of the options in solr. > > 1. indexing term vectors: this means given a document, you can go > lookup a miniature "inverted index" just for that document. That means > each document has "term vectors" which has a term dictionary of the > terms in that one document, and optionally things like positions and > character offsets. This can be useful if you are examining *many > terms* for just a few documents. For example: the MoreLikeThis use > case. In solr this is activated with termVectors=true. To additionally > store positions/offsets information inside the term vectors its > termPositions and termOffsets, respectively. > > 2. indexing character offsets: this means given a term, you can get > the offset information "along with" each position that matched. So > really you can think of this as a special form of a payload. This is > useful if you are examining *many documents* for just a few terms. For > example, many highlighting use cases. In solr this is activated with > storeOffsetsWithPositions=true. It is unrelated to term vectors. > > Hopefully this helps. > > On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French > wrote: > > This looks like a codec issue, but I'm not sure how to address it. I've > > found that a different instance of DocsAndPositionsEnum is instantiated > > between my code and Solr's TermVectorComponent. > > > > Mine: > > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum > > Solr: > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum > > > > As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure > where > > the Lucene 4.1 reference comes from. I've searched through the Solr > config > > files and can't see where to change the codec, but shouldn't the reader > use > > the same codec as used when the index was created? > > > > > > On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French >wrote: > > > >> We have an API on top of Lucene 4.6 that I'm trying to adapt to running > >> under Solr 4.6. The problem is although I'm getting the correct offsets > >> when the index is created by Lucene, the same method calls always > return -1 > >> when the index is created by Solr. In the latter case I can see the > >> character offsets via Luke, and I can even get them from Solr when I > access > >> the /tvrh search handler, which uses the TermVectorComponent class. > >> > >> This is roughly how I'm reading character offsets in my Lucene code: > >> > >>> AtomicReader reader = ... > >>> Term term = ... > >>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term); > >>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { > >>> for (int i = 0; i < postings.freq(); i++) { > >>> System.out.println("start:" + postings.startOffset()); > >>> System.out.println("end:" + postings.endOffset()); > >>> } > >>> } > >> > >> > >> Notice that I want the values for a single term. When run against an > index > >> created by Solr, the above calls to startOffset() and endOffset() return > >> -1. Solr's TermVectorComponent prints the correct offsets like this > >> (paraphrased): > >> > >> IndexReader reader = searcher.getIndexReader(); > >>> Terms vector = reader.getTermVector(docId, field); > >>> TermsEnum termsEnum = vector.iterator(termsEnum); > >>> int freq = (int) termsEnum.totalTermFreq(); > >>> DocsAndPositionsEnum dpEnum = null; > >>> while((text = termsEnum.next()) != null) { > >>> String term = text.utf8ToString(); > >>> dpEnum = termsEnum.docsAndPositions(null, dpEnum); > >>> dpEnum.nextDoc(); > >>> for (int i = 0; i < freq; i++) { > >>> final int pos = dpEnum.nextPosition(); > >>> System.out.println("start:" + dpEnum.startOffset()); > >>> System.out.println("end:" + dpEnum.endOffset()); > >>> } > >>> } > >> > >> > >> but in this case it is getting the offsets per doc ID, rather than a > >> single term, which is what I want. > >> > >> Could anyone tell me: > >> > >>1. Why I'm not able to get the offsets using my first example, and/or > >>2. A better way to get the offsets for a given term? > >> > >> Thanks. > >> > >>Jeff > >> > >> > >> > >> > >> > >> > >> > >> > >> >