Solr Newbie question: doubts about how to handle html content

Marcio Pinto Motta Thu, 05 Oct 2006 04:19:18 -0700

Solr Newbie question: doubts about html content



My "current" problem is to know the best approach to handle content which
have html code.



I have some docs that may or may not have html tag.



My first attempt, I defined a field "text" in my schema.xml :



 <field name="text" type="text" indexed="true" stored="true"/>
<field name="texto"> <br><p>   A Brasil Telecom … <br/><br/><br/>]]></field>


But some docs that have html code throw an error when I tried to send them
to Solr.



My second attempt, I put "<![CDATA[<br><p>   A Brasil Telecom …
<br/><br/><br/>]]>" and I could send the docs to Solr, and,  I could make a
search for "<br>" and retrieve the doc.



But consulting the result page source,  as you can see,

<str name="text">

&lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>

the html code was "changed".





My third approach  is to create 2 fields in my schema:



. One with original content

. One with no html code, which will be indexed.



But I don't know how to preserve this html content in my new field. My
question is:

How to put these docs in Solr, search them, and retrieve de original <html>
content.



Thanks for attention.



BR,



Marcio

Solr Newbie question: doubts about how to handle html content

Reply via email to