Re: Strip html

Chris Hostetter Thu, 31 May 2012 15:04:37 -0700

: I make a transformation XSLT which return :
: ---------------------------------------
: si les ruches d’abeilles prouvent la
:                   monarchie, les fourmillières, les troupes d’éléphants ou
: de castors prouvent la république.
: ---------------------------------------
: i put this html in solr:  $doc->addField('body_strip_html', $body_norm);   
        ...
: But this don't work!
: I want to return this xml files (look exemple) if i search "castor".


I'm confused.

a) you said you've already transformed your input XML into plain text -- 
so i don't see what you need HTML striping at all.
b) your current problem doesn't seem to have anything to do with HTML or 
XML ... you're asking why a document containing "castors" (plural) doesn't 
match a query for "castor" (singular) but the field type you say are using 
has a very simple analyzer that doens't do any stemming of any kind...

>>        <analyzer>
>>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>                <tokenizer class="solr.StandardTokenizerFactory"/>
>>        </analyzer>

..since there is no HTML in your input, HTMLStripCharFilterFactory is a 
no-op.  which leaves StandardTokenizerFactory which just does 
tokenization.

It seems like all you need to do is add a stemmer (and for efficiency: 
remove the HTMLStripCharFilterFactory).  I'm no expert, but it looks like 
you are indexing french, so i would suggest using a french stemmer...

https://wiki.apache.org/solr/LanguageAnalysis#French



-Hoss

Re: Strip html

Reply via email to