The bottom line is that you will need to have your own code that will detect
the "choice" tag and map it to the desired choice, and you will have to do
that before you "strip" html.
So, given:
<choice>
<orig>C</orig>
<reg>c</reg>
</choice>astors
Your code will have to remove "<choice>...</choice>" and replace it with the
element content of the "<orig>" or "<reg>" element - but not both.
Otherwise, "Strip" HTML (either in PHP or Solr) will preserver the white
space between "</reg>" and "</choice>", which was causing the "c" to be
separate from "astors".
In short, your PHP code should not use strip_html, but must replace the
"<choice>...</choice>", but do keep the strip HTML in the Solr schema to
remove the rest of the HTML.
-- Jack Krupansky
-----Original Message-----
From: Tigunn
Sent: Friday, June 01, 2012 9:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Strip html
Thanks for your answers. Unfortunately, i can't try before monday.
In first my solr's settings:
In schema.xml:
In my php :
in a loop on all document xml of my database Exist-db (xml database wich
store xml files)
A exemple of a doc xml:
I follow the steps:
1 - i transform xml to html, it's a xsl sheet (not mine, but i can change
xsl sheets to generate a text whitout html: i want to try).
For information xslt1.0 return for the exemple:
You can notice : the word "castors" is break by html tag
2 - I want to strip html tags before indexing.
i try in php: $body_norm = strip_tags($body_norm);
with the actual fieldType define in schema.xml it's wrong.
But i want to try
What do you think about?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Strip-html-tp3987051p3987253.html
Sent from the Solr - User mailing list archive at Nabble.com.