If you want to keep stopwords, take the stopword filter out of your analysis chain.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Jul 15, 2014, at 1:36 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51 > > If you don't set the attribute in XML file, it falls back to the > default definitions. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <amantandon...@gmail.com> wrote: >> Hi jack, >> >> >> it will use the internal *Lucene hardwired list* of stop words >> >> >> I am unaware of this, could you please provide the more information about >> this. >> >> >> With Regards >> Aman Tandon >> >> >> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <arafa...@gmail.com> >> wrote: >> >>> You could try experimenting with CommonGramsFilterFactory and >>> CommonGramsQueryFilter (slightly different). There is actually a lot >>> of cool analyzers bundled with Solr. You can find full list on my site >>> at: http://www.solr-start.com/info/analyzers >>> >>> Regards, >>> Alex. >>> Personal: http://www.outerthoughts.com/ and @arafalov >>> Solr resources: http://www.solr-start.com/ and @solrstart >>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 >>> >>> >>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teag...@insystechinc.com> >>> wrote: >>>> Alex, >>>> >>>> Thanks! Great suggestion. I figured out that it was the >>> EdgeNGramFilterFactory. Taking that out of the mix did it. >>>> >>>> -Teague >>>> >>>> -----Original Message----- >>>> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] >>>> Sent: Monday, July 14, 2014 9:14 PM >>>> To: solr-user >>>> Subject: Re: Of, To, and Other Small Words >>>> >>>> Have you tried the Admin UI's Analyze screen. Because it will show you >>> what happens to the text as it progresses through the tokenizers and >>> filters. No need to reindex. >>>> >>>> Regards, >>>> Alex. >>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: >>> http://www.solr-start.com/ and @solrstart Solr popularizers community: >>> https://www.linkedin.com/groups?gid=6713853 >>>> >>>> >>>> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teag...@insystechinc.com> >>> wrote: >>>>> Hi Anshum, >>>>> >>>>> Thanks for replying and suggesting this, but the field type I am using >>> (a modified text_general) in my schema has the file set to 'stopwords.txt'. >>>>> >>>>> <fieldType name="text_general" class="solr.TextField" >>> positionIncrementGap="100"> >>>>> <analyzer type="index"> >>>>> <tokenizer >>> class="solr.StandardTokenizerFactory"/> >>>>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" words="stopwords.txt" /> >>>>> <!-- in this example, we will only use synonyms >>> at query time >>>>> <filter class="solr.SynonymFilterFactory" >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>--> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <!-- CHANGE: The NGramFilterFactory was added >>> to provide partial word search. This can be changed to >>>>> EdgeNGramFilterFactory side="front" to only >>> match front sided partial searches if matching any >>>>> part of a word is undesireable.--> >>>>> <filter class="solr.NGramFilterFactory" >>> minGramSize="3" maxGramSize="10" /> >>>>> <!-- CHANGE: The PorterStemFilterFactory was >>> added to allow matches for 'cat' and 'cats' by searching for 'cat' --> >>>>> <filter class="solr.PorterStemFilterFactory"/> >>>>> </analyzer> >>>>> <analyzer type="query"> >>>>> <tokenizer >>> class="solr.StandardTokenizerFactory"/> >>>>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" words="stopwords.txt" /> >>>>> <filter class="solr.SynonymFilterFactory" >>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <!-- CHANGE: The PorterStemFilterFactory was >>> added to allow matches for 'cat' and 'cats' by searching for 'cat' --> >>>>> <filter class="solr.PorterStemFilterFactory"/> >>>>> </analyzer> >>>>> </fieldType> >>>>> >>>>> Just to be double sure I cleared the list in stopwords_en.txt, >>> restarted Solr, re-indexed, and searched with still zero results. Any other >>> suggestions on where I might be able to control this behavior? >>>>> >>>>> -Teague >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Anshum Gupta [mailto:ans...@anshumgupta.net] >>>>> Sent: Monday, July 14, 2014 4:04 PM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: Of, To, and Other Small Words >>>>> >>>>> Hi Teague, >>>>> >>>>> The StopFilterFactory (which I think you're using) by default uses >>> lang/stopwords_en.txt (which wouldn't be empty if you check). >>>>> What you're looking at is the stopword.txt. You could either empty that >>> file out or change the field type for your field. >>>>> >>>>> >>>>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James < >>> teag...@insystechinc.com> wrote: >>>>>> Hello all, >>>>>> >>>>>> I am working with Solr 4.9.0 and am searching for phrases that >>>>>> contain words like "of" or "to" that Solr seems to be ignoring at >>> index time. >>>>>> Here's what I tried: >>>>>> >>>>>> curl http://localhost/solr/update?commit=true -H "Content-Type: >>> text/xml" >>>>>> --data-binary '<add><doc><field name="id">100</field><field >>>>>> name="content">blah blah blah knowledge of science blah blah >>>>>> blah</field></doc></add>' >>>>>> >>>>>> Then, using a broswer: >>>>>> >>>>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq= >>>>>> i >>>>>> d:100 >>>>>> >>>>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >>>>>> "knowledge of" or "of science" and I get zero hits. I don't want to >>>>>> use proximity if I can avoid it, as this may introduce too many >>>>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is >>> ignoring "of" and "to" >>>>>> and possibly more words that I have not discovered through testing >>>>>> yet. Is there some other configuration file that contains these small >>>>>> words? Is there any way to force Solr to pay attention to them and >>>>>> not drop them from the phrase? Any advice is appreciated! Thanks! >>>>>> >>>>>> -Teague >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Anshum Gupta >>>>> http://www.anshumgupta.net >>>>> >>>> >>>