RE: Extracting important multi term phrases from the text

Markus Jelsma Thu, 15 Nov 2018 15:50:38 -0800

Hello Pratik,

How about not using StopFilter at all? We got rid of it a long time ago, and 
only use it in very specific circumstances.


LUCENE-4065 is not going to be fixed any time soon. Removing StopFilter will 
introduce noise, but you could work around it with SKG. Please let us know if 
it works for you.

Rergards,
Markus

 
 
-----Original message-----
> From:Pratik Patel <[email protected]>
> Sent: Thursday 15th November 2018 23:16
> To: [email protected]
> Subject: Re: Extracting important multi term phrases from the text
> 
> Hi Markus,
> 
> Thanks for the reply. I tried using ShingleFilter and it seems to
> be working. However, I am hitting an issue when it is used with
> StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
> and it kind of screws up the data in index.
> 
> I tried setting enablePositionIncrements="false" for stop word filter but
> that parameter only works for lucene version 4.3 or earlier. Looks like
> it's an open issue in lucene
> https://issues.apache.org/jira/browse/LUCENE-4065
> 
> For now, I am trying to find a workaround using PatternReplaceFilterFactory.
> 
> Regards,
> Pratik
> 
> On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <[email protected]>
> wrote:
> 
> > Hello Pratik,
> >
> > We would use ShingleFilter for this indeed. If you only want
> > bigrams/shingles, don't forget to disable outputUnigrams and set both
> > shinle size limits to 2.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Pratik Patel <[email protected]>
> > > Sent: Thursday 15th November 2018 17:00
> > > To: [email protected]
> > > Subject: Extracting important multi term phrases from the text
> > >
> > > Hello Everyone,
> > >
> > > Standard way of tokenizing in solr would divide the text by white space
> > in
> > > solr.
> > >
> > > Is there a way by which we can index multi-term phrases like "Machine
> > > Learning" instead of "Machine", "Learning"?
> > > Is it possible to create a specific field type for such phrases which has
> > > its own indexing pipeline? I am open to storing n-grams but these n-grams
> > > would be across terms and not just one term? In other words, I don't want
> > > to store n-grams of the term "machine", I want to store n-grams for a
> > > sentence like below.
> > >
> > > "I like machine learning" --> "I like", "like machine", "machine
> > learning"
> > > and so on.....
> > >
> > > It seems like Shingle Filter (
> > >
> > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> > )
> > > may be used for this. Is there a better alternative?
> > >
> > > I want to use this field as an input to Semantic Knowledge Graph. The
> > > plugin works great for words. But now I want to use it for phrases. Any
> > > idea around this would be really helpful.
> > >
> > > Thanks a lot!
> > >
> > > - Pratik
> > >
> >
>

RE: Extracting important multi term phrases from the text

Reply via email to