Your first definition of text_fr seems to be correct and should work
as expected. I tested it and worked fine ("mémé" was highlighted).

What was the output of HTMLStripCharFilterFactory in analysis.jsp?
In my analysis.jsp, I got "ça va mémé ?".

Koji


Kundig, Andreas wrote:
Hello

I indexed an html document with a decimal HTML Entity encodings: the character é (e 
with an acute accent) is encoded as é The exact content of the document is:

<html><body>&#231;a va m&#233;m&#233; ?</body></html>

A search for 'mémé' returns no document. If I put the line above in solr admin's 
analysis.jsp it also doesn't match mémé. There is only a match if I replace 
&#233; by é .

This is how I configured the fieldType:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I tried avoiding the problem by using the MappingCharFilterFactory:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I put the file mapping.txt in the conf directory. It contains just this:

"&#233;" => "é"

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.



--
http://www.rondhuit.com/en/

Reply via email to