Re: Extracting important multi term phrases from the text

Walter Underwood Thu, 15 Nov 2018 15:52:57 -0800

+1 for not using stopwords. I haven’t used them since 1996. When I was at 
Netflix, I collected some movie titles that were 100% stopwords.


https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Nov 15, 2018, at 3:50 PM, Markus Jelsma <[email protected]> wrote:
> 
> Hello Pratik,
> 
> How about not using StopFilter at all? We got rid of it a long time ago, and 
> only use it in very specific circumstances.
> 
> LUCENE-4065 is not going to be fixed any time soon. Removing StopFilter will 
> introduce noise, but you could work around it with SKG. Please let us know if 
> it works for you.
> 
> Rergards,
> Markus
> 
> -----Original message-----
>> From:Pratik Patel <[email protected]>
>> Sent: Thursday 15th November 2018 23:16
>> To: [email protected]
>> Subject: Re: Extracting important multi term phrases from the text
>> 
>> Hi Markus,
>> 
>> Thanks for the reply. I tried using ShingleFilter and it seems to
>> be working. However, I am hitting an issue when it is used with
>> StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
>> and it kind of screws up the data in index.
>> 
>> I tried setting enablePositionIncrements="false" for stop word filter but
>> that parameter only works for lucene version 4.3 or earlier. Looks like
>> it's an open issue in lucene
>> https://issues.apache.org/jira/browse/LUCENE-4065
>> 
>> For now, I am trying to find a workaround using PatternReplaceFilterFactory.
>> 
>> Regards,
>> Pratik
>> 
>> On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <[email protected]>
>> wrote:
>> 
>>> Hello Pratik,
>>> 
>>> We would use ShingleFilter for this indeed. If you only want
>>> bigrams/shingles, don't forget to disable outputUnigrams and set both
>>> shinle size limits to 2.
>>> 
>>> Regards,
>>> Markus
>>> 
>>> -----Original message-----
>>>> From:Pratik Patel <[email protected]>
>>>> Sent: Thursday 15th November 2018 17:00
>>>> To: [email protected]
>>>> Subject: Extracting important multi term phrases from the text
>>>> 
>>>> Hello Everyone,
>>>> 
>>>> Standard way of tokenizing in solr would divide the text by white space
>>> in
>>>> solr.
>>>> 
>>>> Is there a way by which we can index multi-term phrases like "Machine
>>>> Learning" instead of "Machine", "Learning"?
>>>> Is it possible to create a specific field type for such phrases which has
>>>> its own indexing pipeline? I am open to storing n-grams but these n-grams
>>>> would be across terms and not just one term? In other words, I don't want
>>>> to store n-grams of the term "machine", I want to store n-grams for a
>>>> sentence like below.
>>>> 
>>>> "I like machine learning" --> "I like", "like machine", "machine
>>> learning"
>>>> and so on.....
>>>> 
>>>> It seems like Shingle Filter (
>>>> 
>>> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
>>> )
>>>> may be used for this. Is there a better alternative?
>>>> 
>>>> I want to use this field as an input to Semantic Knowledge Graph. The
>>>> plugin works great for words. But now I want to use it for phrases. Any
>>>> idea around this would be really helpful.
>>>> 
>>>> Thanks a lot!
>>>> 
>>>> - Pratik
>>>> 
>>> 
>>

Re: Extracting important multi term phrases from the text

Reply via email to