Re: Strip html

Jack Krupansky Fri, 01 Jun 2012 07:39:03 -0700

The bottom line is that you will need to have your own code that will detectthe "choice" tag and map it to the desired choice, and you will have to dothat before you "strip" html.


So, given:


                                   <choice>
                                       <orig>C</orig>
                                       <reg>c</reg>
                                   </choice>astors

Your code will have to remove "<choice>...</choice>" and replace it with theelement content of the "<orig>" or "<reg>" element - but not both.

Otherwise, "Strip" HTML (either in PHP or Solr) will preserver the whitespace between "</reg>" and "</choice>", which was causing the "c" to beseparate from "astors".

In short, your PHP code should not use strip_html, but must replace the"<choice>...</choice>", but do keep the strip HTML in the Solr schema toremove the rest of the HTML.


-- Jack Krupansky

-----Original Message-----From: Tigunn

Sent: Friday, June 01, 2012 9:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Strip html

Thanks for your answers. Unfortunately, i can't try before monday.

In first my solr's settings:
In schema.xml:

In my php :
in a loop on all document xml of my database Exist-db (xml database wich
store xml files)

A exemple of a doc xml:

I follow the steps:
1 - i transform xml to html, it's a xsl sheet (not mine, but i can change
xsl sheets to generate a text whitout html: i want to try).
For information xslt1.0 return for the exemple:

You can notice : the word "castors" is break by html tag

2 - I want to strip html tags before indexing.
i try in php:      $body_norm = strip_tags($body_norm);
with the actual fieldType define in schema.xml it's wrong.
But i want to try
What do you think about?

--

View this message in context:http://lucene.472066.n3.nabble.com/Strip-html-tp3987051p3987253.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: Strip html

Reply via email to