Re: Of, To, and Other Small Words

Walter Underwood Tue, 15 Jul 2014 07:24:12 -0700

If you want to keep stopwords, take the stopword filter out of your analysis 
chain.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Jul 15, 2014, at 1:36 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

> https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51
> 
> If you don't set the attribute in XML file, it falls back to the
> default definitions.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <amantandon...@gmail.com> wrote:
>> Hi jack,
>> 
>> 
>> it will use the internal *Lucene hardwired list* of stop words
>> 
>> 
>> I am unaware of this, could you please provide the more information about
>> this.
>> 
>> 
>> With Regards
>> Aman Tandon
>> 
>> 
>> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <arafa...@gmail.com>
>> wrote:
>> 
>>> You could try experimenting with CommonGramsFilterFactory and
>>> CommonGramsQueryFilter (slightly different). There is actually a lot
>>> of cool analyzers bundled with Solr. You can find full list on my site
>>> at: http://www.solr-start.com/info/analyzers
>>> 
>>> Regards,
>>>   Alex.
>>> Personal: http://www.outerthoughts.com/ and @arafalov
>>> Solr resources: http://www.solr-start.com/ and @solrstart
>>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>> 
>>> 
>>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com>
>>> wrote:
>>>> Alex,
>>>> 
>>>> Thanks! Great suggestion. I figured out that it was the
>>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>>>> 
>>>> -Teague
>>>> 
>>>> -----Original Message-----
>>>> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
>>>> Sent: Monday, July 14, 2014 9:14 PM
>>>> To: solr-user
>>>> Subject: Re: Of, To, and Other Small Words
>>>> 
>>>> Have you tried the Admin UI's Analyze screen. Because it will show you
>>> what happens to the text as it progresses through the tokenizers and
>>> filters. No need to reindex.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>>> https://www.linkedin.com/groups?gid=6713853
>>>> 
>>>> 
>>>> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com>
>>> wrote:
>>>>> Hi Anshum,
>>>>> 
>>>>> Thanks for replying and suggesting this, but the field type I am using
>>> (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>>>>> 
>>>>>        <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>>>                <analyzer type="index">
>>>>>                        <tokenizer
>>> class="solr.StandardTokenizerFactory"/>
>>>>>                        <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="stopwords.txt" />
>>>>>                        <!-- in this example, we will only use synonyms
>>> at query time
>>>>>                        <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <!-- CHANGE: The NGramFilterFactory was added
>>> to provide partial word search. This can be changed to
>>>>>                        EdgeNGramFilterFactory side="front" to only
>>> match front sided partial searches if matching any
>>>>>                        part of a word is undesireable.-->
>>>>>                        <filter class="solr.NGramFilterFactory"
>>> minGramSize="3" maxGramSize="10" />
>>>>>                        <!-- CHANGE: The PorterStemFilterFactory was
>>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                </analyzer>
>>>>>                <analyzer type="query">
>>>>>                        <tokenizer
>>> class="solr.StandardTokenizerFactory"/>
>>>>>                        <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="stopwords.txt" />
>>>>>                        <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <!-- CHANGE: The PorterStemFilterFactory was
>>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                </analyzer>
>>>>>        </fieldType>
>>>>> 
>>>>> Just to be double sure I cleared the list in stopwords_en.txt,
>>> restarted Solr, re-indexed, and searched with still zero results. Any other
>>> suggestions on where I might be able to control this behavior?
>>>>> 
>>>>> -Teague
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
>>>>> Sent: Monday, July 14, 2014 4:04 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Of, To, and Other Small Words
>>>>> 
>>>>> Hi Teague,
>>>>> 
>>>>> The StopFilterFactory (which I think you're using) by default uses
>>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>>>>> What you're looking at is the stopword.txt. You could either empty that
>>> file out or change the field type for your field.
>>>>> 
>>>>> 
>>>>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>>> teag...@insystechinc.com> wrote:
>>>>>> Hello all,
>>>>>> 
>>>>>> I am working with Solr 4.9.0 and am searching for phrases that
>>>>>> contain words like "of" or "to" that Solr seems to be ignoring at
>>> index time.
>>>>>> Here's what I tried:
>>>>>> 
>>>>>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>>> text/xml"
>>>>>> --data-binary '<add><doc><field name="id">100</field><field
>>>>>> name="content">blah blah blah knowledge of science blah blah
>>>>>> blah</field></doc></add>'
>>>>>> 
>>>>>> Then, using a broswer:
>>>>>> 
>>>>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>>>>> i
>>>>>> d:100
>>>>>> 
>>>>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>>>>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>>>>> use proximity if I can avoid it, as this may introduce too many
>>>>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>>> ignoring "of" and "to"
>>>>>> and possibly more words that I have not discovered through testing
>>>>>> yet. Is there some other configuration file that contains these small
>>>>>> words? Is there any way to force Solr to pay attention to them and
>>>>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>>>> 
>>>>>> -Teague
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Anshum Gupta
>>>>> http://www.anshumgupta.net
>>>>> 
>>>> 
>>>

Re: Of, To, and Other Small Words

Reply via email to