+1 for not using stopwords. I haven’t used them since 1996. When I was at Netflix, I collected some movie titles that were 100% stopwords.
https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Nov 15, 2018, at 3:50 PM, Markus Jelsma <[email protected]> wrote: > > Hello Pratik, > > How about not using StopFilter at all? We got rid of it a long time ago, and > only use it in very specific circumstances. > > LUCENE-4065 is not going to be fixed any time soon. Removing StopFilter will > introduce noise, but you could work around it with SKG. Please let us know if > it works for you. > > Rergards, > Markus > > -----Original message----- >> From:Pratik Patel <[email protected]> >> Sent: Thursday 15th November 2018 23:16 >> To: [email protected] >> Subject: Re: Extracting important multi term phrases from the text >> >> Hi Markus, >> >> Thanks for the reply. I tried using ShingleFilter and it seems to >> be working. However, I am hitting an issue when it is used with >> StopWordFilter. StopWordFilter leaves an underscore "_" for removed words >> and it kind of screws up the data in index. >> >> I tried setting enablePositionIncrements="false" for stop word filter but >> that parameter only works for lucene version 4.3 or earlier. Looks like >> it's an open issue in lucene >> https://issues.apache.org/jira/browse/LUCENE-4065 >> >> For now, I am trying to find a workaround using PatternReplaceFilterFactory. >> >> Regards, >> Pratik >> >> On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <[email protected]> >> wrote: >> >>> Hello Pratik, >>> >>> We would use ShingleFilter for this indeed. If you only want >>> bigrams/shingles, don't forget to disable outputUnigrams and set both >>> shinle size limits to 2. >>> >>> Regards, >>> Markus >>> >>> -----Original message----- >>>> From:Pratik Patel <[email protected]> >>>> Sent: Thursday 15th November 2018 17:00 >>>> To: [email protected] >>>> Subject: Extracting important multi term phrases from the text >>>> >>>> Hello Everyone, >>>> >>>> Standard way of tokenizing in solr would divide the text by white space >>> in >>>> solr. >>>> >>>> Is there a way by which we can index multi-term phrases like "Machine >>>> Learning" instead of "Machine", "Learning"? >>>> Is it possible to create a specific field type for such phrases which has >>>> its own indexing pipeline? I am open to storing n-grams but these n-grams >>>> would be across terms and not just one term? In other words, I don't want >>>> to store n-grams of the term "machine", I want to store n-grams for a >>>> sentence like below. >>>> >>>> "I like machine learning" --> "I like", "like machine", "machine >>> learning" >>>> and so on..... >>>> >>>> It seems like Shingle Filter ( >>>> >>> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter >>> ) >>>> may be used for this. Is there a better alternative? >>>> >>>> I want to use this field as an input to Semantic Knowledge Graph. The >>>> plugin works great for words. But now I want to use it for phrases. Any >>>> idea around this would be really helpful. >>>> >>>> Thanks a lot! >>>> >>>> - Pratik >>>> >>> >>
