Re: Getting a list of matching terms and offsets

Justin Lee Sun, 05 Jun 2016 11:11:23 -0700

Thanks for the responses Alex and Ahmet.

The TermVector component was the first thing I looked at, but what it gives
you is offset information for every token in the document.  I'm trying to
get a list of tokens that actually match the search query, and unless I'm
missing something, the TermVector component doesn't give you that
information.


The TermSpans class does contain the right information, but again the hard
part is: how do I reliably get a list of TokenSpans for the tokens that
actually match the search query?  That's why I ended up in the highlighter
source code, because the highlighter has to do just this in order to create
snippets with accurate highlighting.

Justin

On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi,
>
> May be org.apache.lucene.search.spans.TermSpans ?
>
>
>
> On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
> It sounds like TermVector component's output:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
>
> Perhaps with additional flags enabled (e.g. tv.offsets and/or
> tv.positions).
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
>
> On 5 June 2016 at 07:39, Justin Lee <lee.justi...@gmail.com> wrote:
> > Is anyone aware of a way of getting a list of each matching token and
> their
> > offsets after executing a search?  The reason I want to do this is
> because
> > I have the physical coordinates of each token in the original document
> > stored out of band, and I want to be able to highlight in the original
> > document.  I would really like to have Solr return the list of matching
> > tokens because then things like stemming and phrase matching will work as
> > expected. I'm thinking of something like the highlighter component,
> except
> > instead of returning html, it would return just the matching tokens and
> > their offsets.
> >
> > I have googled high and low and can't seem to find an exact answer to
> this
> > question, so I have spent the last few days examining the internals of
> the
> > various highlighting classes in Solr and Lucene.  I think the bulk of the
> > action is in WeightedSpanTermExtractor and its interaction with
> > getBestTextFragments in the Highlighter class.  But before I spend
> anymore
> > time on this I thought I'd ask (1) whether anyone knows of an easier way
> of
> > doing this, and (2) whether I'm at least barking up the right tree.
> >
> > Thanks much,
> > Justin
>

Re: Getting a list of matching terms and offsets

Reply via email to