On Jan 7, 2008 5:15 PM, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> Hi all, I am using a mostly out-of-the-box install of Solr that I'm
> using to search through our code repositories.  I've run into a funny
> problem where searches for text that is camelCased aren't returning
> results unless the casing is exactly the same.
>
> For example, a query for "getElementById" returns 364 results, but
> "getelementbyid" returns 0.
>
> There isn't a problem with all casings, though.  For example, "function"
> and "Function" both return the same number of results, as does
> "FUNCTION" and "FUNCtion" (6,278 with my docs).  However, "funcTION"
> returns only a few results--and it's where the word is actually split up
> (e.g. "func tion")!
>
> So it seems that something may be tokenizing words where casing appears
> in the middle of them!
>
> How can I get this to stop?

remove WordDelimiterFilter.

It's funny though, since WordDelimiterFilter should not have caused
this to happen (a query of getelementbyid should have matched a doc
with getElementById).

-Yonik

> Thanks!
>
> Ben
>
>
> Here's the definition for the text field type in my schema.xml:
>
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- in this example, we will only use synonyms at query time
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>         -->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>

Reply via email to