Thank you, Robert. You are right, I was confused between the two. I also didn't know the "storeOffsetsWithPositions" existed. My code works as I expected now.
On Mon, Mar 10, 2014 at 11:11 PM, Robert Muir <rcm...@gmail.com> wrote: > Hello, I think you are confused between two different index > structures, probably because of the name of the options in solr. > > 1. indexing term vectors: this means given a document, you can go > lookup a miniature "inverted index" just for that document. That means > each document has "term vectors" which has a term dictionary of the > terms in that one document, and optionally things like positions and > character offsets. This can be useful if you are examining *many > terms* for just a few documents. For example: the MoreLikeThis use > case. In solr this is activated with termVectors=true. To additionally > store positions/offsets information inside the term vectors its > termPositions and termOffsets, respectively. > > 2. indexing character offsets: this means given a term, you can get > the offset information "along with" each position that matched. So > really you can think of this as a special form of a payload. This is > useful if you are examining *many documents* for just a few terms. For > example, many highlighting use cases. In solr this is activated with > storeOffsetsWithPositions=true. It is unrelated to term vectors. > > Hopefully this helps. > > On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French <jkfaus...@gmail.com> > wrote: > > This looks like a codec issue, but I'm not sure how to address it. I've > > found that a different instance of DocsAndPositionsEnum is instantiated > > between my code and Solr's TermVectorComponent. > > > > Mine: > > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum > > Solr: > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum > > > > As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure > where > > the Lucene 4.1 reference comes from. I've searched through the Solr > config > > files and can't see where to change the codec, but shouldn't the reader > use > > the same codec as used when the index was created? > > > > > > On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French <jkfaus...@gmail.com > >wrote: > > > >> We have an API on top of Lucene 4.6 that I'm trying to adapt to running > >> under Solr 4.6. The problem is although I'm getting the correct offsets > >> when the index is created by Lucene, the same method calls always > return -1 > >> when the index is created by Solr. In the latter case I can see the > >> character offsets via Luke, and I can even get them from Solr when I > access > >> the /tvrh search handler, which uses the TermVectorComponent class. > >> > >> This is roughly how I'm reading character offsets in my Lucene code: > >> > >>> AtomicReader reader = ... > >>> Term term = ... > >>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term); > >>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) { > >>> for (int i = 0; i < postings.freq(); i++) { > >>> System.out.println("start:" + postings.startOffset()); > >>> System.out.println("end:" + postings.endOffset()); > >>> } > >>> } > >> > >> > >> Notice that I want the values for a single term. When run against an > index > >> created by Solr, the above calls to startOffset() and endOffset() return > >> -1. Solr's TermVectorComponent prints the correct offsets like this > >> (paraphrased): > >> > >> IndexReader reader = searcher.getIndexReader(); > >>> Terms vector = reader.getTermVector(docId, field); > >>> TermsEnum termsEnum = vector.iterator(termsEnum); > >>> int freq = (int) termsEnum.totalTermFreq(); > >>> DocsAndPositionsEnum dpEnum = null; > >>> while((text = termsEnum.next()) != null) { > >>> String term = text.utf8ToString(); > >>> dpEnum = termsEnum.docsAndPositions(null, dpEnum); > >>> dpEnum.nextDoc(); > >>> for (int i = 0; i < freq; i++) { > >>> final int pos = dpEnum.nextPosition(); > >>> System.out.println("start:" + dpEnum.startOffset()); > >>> System.out.println("end:" + dpEnum.endOffset()); > >>> } > >>> } > >> > >> > >> but in this case it is getting the offsets per doc ID, rather than a > >> single term, which is what I want. > >> > >> Could anyone tell me: > >> > >> 1. Why I'm not able to get the offsets using my first example, and/or > >> 2. A better way to get the offsets for a given term? > >> > >> Thanks. > >> > >> Jeff > >> > >> > >> > >> > >> > >> > >> > >> > >> >