Hi Markus, Thanks for the reply. I tried using ShingleFilter and it seems to be working. However, I am hitting an issue when it is used with StopWordFilter. StopWordFilter leaves an underscore "_" for removed words and it kind of screws up the data in index.
I tried setting enablePositionIncrements="false" for stop word filter but that parameter only works for lucene version 4.3 or earlier. Looks like it's an open issue in lucene https://issues.apache.org/jira/browse/LUCENE-4065 For now, I am trying to find a workaround using PatternReplaceFilterFactory. Regards, Pratik On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello Pratik, > > We would use ShingleFilter for this indeed. If you only want > bigrams/shingles, don't forget to disable outputUnigrams and set both > shinle size limits to 2. > > Regards, > Markus > > -----Original message----- > > From:Pratik Patel <pra...@semandex.net> > > Sent: Thursday 15th November 2018 17:00 > > To: solr-user@lucene.apache.org > > Subject: Extracting important multi term phrases from the text > > > > Hello Everyone, > > > > Standard way of tokenizing in solr would divide the text by white space > in > > solr. > > > > Is there a way by which we can index multi-term phrases like "Machine > > Learning" instead of "Machine", "Learning"? > > Is it possible to create a specific field type for such phrases which has > > its own indexing pipeline? I am open to storing n-grams but these n-grams > > would be across terms and not just one term? In other words, I don't want > > to store n-grams of the term "machine", I want to store n-grams for a > > sentence like below. > > > > "I like machine learning" --> "I like", "like machine", "machine > learning" > > and so on..... > > > > It seems like Shingle Filter ( > > > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter > ) > > may be used for this. Is there a better alternative? > > > > I want to use this field as an input to Semantic Knowledge Graph. The > > plugin works great for words. But now I want to use it for phrases. Any > > idea around this would be really helpful. > > > > Thanks a lot! > > > > - Pratik > > >