Re: Of, To, and Other Small Words

Alexandre Rafalovitch Tue, 15 Jul 2014 01:38:24 -0700

https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51


If you don't set the attribute in XML file, it falls back to the
default definitions.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <amantandon...@gmail.com> wrote:
> Hi jack,
>
>
> it will use the internal *Lucene hardwired list* of stop words
>
>
> I am unaware of this, could you please provide the more information about
> this.
>
>
> With Regards
> Aman Tandon
>
>
> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> You could try experimenting with CommonGramsFilterFactory and
>> CommonGramsQueryFilter (slightly different). There is actually a lot
>> of cool analyzers bundled with Solr. You can find full list on my site
>> at: http://www.solr-start.com/info/analyzers
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com>
>> wrote:
>> > Alex,
>> >
>> > Thanks! Great suggestion. I figured out that it was the
>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>> >
>> > -Teague
>> >
>> > -----Original Message-----
>> > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
>> > Sent: Monday, July 14, 2014 9:14 PM
>> > To: solr-user
>> > Subject: Re: Of, To, and Other Small Words
>> >
>> > Have you tried the Admin UI's Analyze screen. Because it will show you
>> what happens to the text as it progresses through the tokenizers and
>> filters. No need to reindex.
>> >
>> > Regards,
>> >    Alex.
>> > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>> https://www.linkedin.com/groups?gid=6713853
>> >
>> >
>> > On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com>
>> wrote:
>> >> Hi Anshum,
>> >>
>> >> Thanks for replying and suggesting this, but the field type I am using
>> (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>> >>
>> >>         <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> >>                 <analyzer type="index">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <!-- in this example, we will only use synonyms
>> at query time
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The NGramFilterFactory was added
>> to provide partial word search. This can be changed to
>> >>                         EdgeNGramFilterFactory side="front" to only
>> match front sided partial searches if matching any
>> >>                         part of a word is undesireable.-->
>> >>                         <filter class="solr.NGramFilterFactory"
>> minGramSize="3" maxGramSize="10" />
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>                 <analyzer type="query">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>         </fieldType>
>> >>
>> >> Just to be double sure I cleared the list in stopwords_en.txt,
>> restarted Solr, re-indexed, and searched with still zero results. Any other
>> suggestions on where I might be able to control this behavior?
>> >>
>> >> -Teague
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Anshum Gupta [mailto:ans...@anshumgupta.net]
>> >> Sent: Monday, July 14, 2014 4:04 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Of, To, and Other Small Words
>> >>
>> >> Hi Teague,
>> >>
>> >> The StopFilterFactory (which I think you're using) by default uses
>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>> >> What you're looking at is the stopword.txt. You could either empty that
>> file out or change the field type for your field.
>> >>
>> >>
>> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>> teag...@insystechinc.com> wrote:
>> >>> Hello all,
>> >>>
>> >>> I am working with Solr 4.9.0 and am searching for phrases that
>> >>> contain words like "of" or "to" that Solr seems to be ignoring at
>> index time.
>> >>> Here's what I tried:
>> >>>
>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>> text/xml"
>> >>> --data-binary '<add><doc><field name="id">100</field><field
>> >>> name="content">blah blah blah knowledge of science blah blah
>> >>> blah</field></doc></add>'
>> >>>
>> >>> Then, using a broswer:
>> >>>
>> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>> >>> i
>> >>> d:100
>> >>>
>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> >>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> >>> use proximity if I can avoid it, as this may introduce too many
>> >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>> ignoring "of" and "to"
>> >>> and possibly more words that I have not discovered through testing
>> >>> yet. Is there some other configuration file that contains these small
>> >>> words? Is there any way to force Solr to pay attention to them and
>> >>> not drop them from the phrase? Any advice is appreciated! Thanks!
>> >>>
>> >>> -Teague
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Anshum Gupta
>> >> http://www.anshumgupta.net
>> >>
>> >
>>

Re: Of, To, and Other Small Words

Reply via email to