The bottom line is that you will need to have your own code that will detect the "choice" tag and map it to the desired choice, and you will have to do that before you "strip" html.

So, given:

                                   <choice>
                                       <orig>C</orig>
                                       <reg>c</reg>
                                   </choice>astors

Your code will have to remove "<choice>...</choice>" and replace it with the element content of the "<orig>" or "<reg>" element - but not both.

Otherwise, "Strip" HTML (either in PHP or Solr) will preserver the white space between "</reg>" and "</choice>", which was causing the "c" to be separate from "astors".

In short, your PHP code should not use strip_html, but must replace the "<choice>...</choice>", but do keep the strip HTML in the Solr schema to remove the rest of the HTML.

-- Jack Krupansky
-----Original Message----- From: Tigunn
Sent: Friday, June 01, 2012 9:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Strip html

Thanks for your answers. Unfortunately, i can't try before monday.

In first my solr's settings:
In schema.xml:

In my php :
in a loop on all document xml of my database Exist-db (xml database wich
store xml files)


A exemple of a doc xml:


I follow the steps:
1 - i transform xml to html, it's a xsl sheet (not mine, but i can change
xsl sheets to generate a text whitout html: i want to try).
For information xslt1.0 return for the exemple:

You can notice : the word "castors" is break by html tag


2 - I want to strip html tags before indexing.
i try in php:      $body_norm = strip_tags($body_norm);
with the actual fieldType define in schema.xml it's wrong.
But i want to try
What do you think about?

--
View this message in context: http://lucene.472066.n3.nabble.com/Strip-html-tp3987051p3987253.html Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to