I have been reading threads all day regarding this topic and nothing
seems to work the way it says it should. :)  I appreciate any and all
help in this matter.

Solr 4 is working perfectly for in all regards with this one exception.

My requirement from Solr4 is very simple.  I am storing a document
like a job description in a text_general field.

I have added a filter for SynonymFilterFactory so that I can map C++
=> cplusplus and c# => csharp during indexing a querying.

Here is the field definition:

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Here is the contents of punctuation-whitelist.txt:

c++ => cplusplus
C# => csharp

I have but one document indexed for the purpose of this test, when I
search for resume_text:C++, I get the following result, which is also
the same result I get when I just search for resume_text:c

You can see from the highlighting that solr is matching on the "C" only


<response>
        <lst name="responseHeader">
                <int name="status">0</int>
                <int name="QTime">20</int>
        </lst>
        <result name="response" numFound="1" start="0" maxScore="0.16273327">
                <doc>
                        <arr name="resume_text">
                                <str>C++ Developer with c# experience, 
including .net</str>
                        </arr>
                </doc>
        </result>
        <lst name="highlighting">
                <lst name="208645">
                        <arr name="resume_text">
                                <str>&lt;em&gt;C&lt;/em&gt;++ Developer with
&lt;em&gt;c&lt;/em&gt;# experience, including .net</str>
                        </arr>
                </lst>
        </lst>
</response>

If I use the Analysis tool in the Solr Web UI, putting "C#" or "C++"
into the Index or Query boxes translates to just "C" in all filters
and tokenizers in the analysis output.

Can someone please explain the _Best_ way to accomplish what I am
trying to do, which is accurately index, search and highlight text
with words like C++ and C#.  I am looking for the "right way" and it's
okay if I have started down the wrong path.

:)

Thank you.
Dave

Reply via email to