Re: Getting a list of matching terms and offsets

Justin Lee Mon, 06 Jun 2016 08:41:56 -0700

Thank you very much!  That JIRA entry led me to
https://issues.apache.org/jira/browse/SOLR-4722, which still works against
Solr 6 with a couple of modifications and should serve as the basis for
what I want to do.  You saved me a bunch of work, so thanks very much.
 (Also, it is always nice to know that people with more experience than me
took the same approach.)


On Sun, Jun 5, 2016 at 1:09 PM Ahmet Arslan <[email protected]>
wrote:

> Hi Lee,
>
> May be you can find useful starting point on
> https://issues.apache.org/jira/browse/SOLR-1397
>
> Please consider to contribute when you gather something working.
>
> Ahmet
>
>
>
>
> On Sunday, June 5, 2016 10:37 PM, Justin Lee <[email protected]>
> wrote:
> Thanks, yea, I looked at debug query too.  Unfortunately the output of
> debug query doesn't quite do it.  For example, if you use a wildcard query,
> it will simply explain the score associated with that wildcard query, not
> the actual matching token.  In order words, if you search for "hour*" and
> the actual matching text is "hours", debug query doesn't tell you that.
> Instead, it just reports the score associated with "hour*".
>
> The closest example I've ever found is this:
>
>
> https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/
>
> But this kind of approach won't let me use the full power of the Solr
> ecosystem.  I'd basically be back to dealing with Lucene directly, which I
> think is a step backwards.  I think the right approach is to write my own
> SearchComponent, using the highlighter as a starting point.  But I wanted
> to make sure there wasn't a simpler way.
>
>
> On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan <[email protected]>
> wrote:
>
> > Well debug query has the list of token that caused match.
> > If i am not mistaken i read an example about span query and spans thing.
> > It was listing the positions of the matches.
> > Cannot find the example at the moment..
> >
> > Ahmet
> >
> >
> >
> > On Sunday, June 5, 2016 9:10 PM, Justin Lee <[email protected]>
> > wrote:
> > Thanks for the responses Alex and Ahmet.
> >
> > The TermVector component was the first thing I looked at, but what it
> gives
> > you is offset information for every token in the document.  I'm trying to
> > get a list of tokens that actually match the search query, and unless I'm
> > missing something, the TermVector component doesn't give you that
> > information.
> >
> > The TermSpans class does contain the right information, but again the
> hard
> > part is: how do I reliably get a list of TokenSpans for the tokens that
> > actually match the search query?  That's why I ended up in the
> highlighter
> > source code, because the highlighter has to do just this in order to
> create
> > snippets with accurate highlighting.
> >
> > Justin
> >
> >
> > On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > May be org.apache.lucene.search.spans.TermSpans ?
> > >
> > >
> > >
> > > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> > [email protected]>
> > > wrote:
> > > It sounds like TermVector component's output:
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> > >
> > > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > > tv.positions).
> > >
> > > Regards,
> > >    Alex.
> > > ----
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > >
> > > On 5 June 2016 at 07:39, Justin Lee <[email protected]> wrote:
> > > > Is anyone aware of a way of getting a list of each matching token and
> > > their
> > > > offsets after executing a search?  The reason I want to do this is
> > > because
> > > > I have the physical coordinates of each token in the original
> document
> > > > stored out of band, and I want to be able to highlight in the
> original
> > > > document.  I would really like to have Solr return the list of
> matching
> > > > tokens because then things like stemming and phrase matching will
> work
> > as
> > > > expected. I'm thinking of something like the highlighter component,
> > > except
> > > > instead of returning html, it would return just the matching tokens
> and
> > > > their offsets.
> > > >
> > > > I have googled high and low and can't seem to find an exact answer to
> > > this
> > > > question, so I have spent the last few days examining the internals
> of
> > > the
> > > > various highlighting classes in Solr and Lucene.  I think the bulk of
> > the
> > > > action is in WeightedSpanTermExtractor and its interaction with
> > > > getBestTextFragments in the Highlighter class.  But before I spend
> > > anymore
> > > > time on this I thought I'd ask (1) whether anyone knows of an easier
> > way
> > > of
> > > > doing this, and (2) whether I'm at least barking up the right tree.
> > > >
> > > > Thanks much,
> > > > Justin
> > >
> >
>

Re: Getting a list of matching terms and offsets

Reply via email to