Hi Markus,

Thanks for the reply. I tried using ShingleFilter and it seems to
be working. However, I am hitting an issue when it is used with
StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
and it kind of screws up the data in index.

I tried setting enablePositionIncrements="false" for stop word filter but
that parameter only works for lucene version 4.3 or earlier. Looks like
it's an open issue in lucene
https://issues.apache.org/jira/browse/LUCENE-4065

For now, I am trying to find a workaround using PatternReplaceFilterFactory.

Regards,
Pratik

On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Pratik,
>
> We would use ShingleFilter for this indeed. If you only want
> bigrams/shingles, don't forget to disable outputUnigrams and set both
> shinle size limits to 2.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Pratik Patel <pra...@semandex.net>
> > Sent: Thursday 15th November 2018 17:00
> > To: solr-user@lucene.apache.org
> > Subject: Extracting important multi term phrases from the text
> >
> > Hello Everyone,
> >
> > Standard way of tokenizing in solr would divide the text by white space
> in
> > solr.
> >
> > Is there a way by which we can index multi-term phrases like "Machine
> > Learning" instead of "Machine", "Learning"?
> > Is it possible to create a specific field type for such phrases which has
> > its own indexing pipeline? I am open to storing n-grams but these n-grams
> > would be across terms and not just one term? In other words, I don't want
> > to store n-grams of the term "machine", I want to store n-grams for a
> > sentence like below.
> >
> > "I like machine learning" --> "I like", "like machine", "machine
> learning"
> > and so on.....
> >
> > It seems like Shingle Filter (
> >
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> )
> > may be used for this. Is there a better alternative?
> >
> > I want to use this field as an input to Semantic Knowledge Graph. The
> > plugin works great for words. But now I want to use it for phrases. Any
> > idea around this would be really helpful.
> >
> > Thanks a lot!
> >
> > - Pratik
> >
>

Reply via email to