On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:
My "current" problem is to know the best approach to handle content which
have html code.



I have some docs that may or may not have html tag.



My first attempt, I defined a field "text" in my schema.xml :



 <field name="text" type="text" indexed="true" stored="true"/>
<field name="texto"> <br><p> A Brasil Telecom … <br/><br/><br/>]] ></field>


But some docs that have html code throw an error when I tried to send them
to Solr.

You must use CDATA or encode entities that have special meaning in XML. I assume you're building the XML to POST to Solr as simply a string. You definitely need to take encoding into consideration to avoid invalid XML. I suspect whatever language you're communicating to Solr with has reasonable XML utilities you can leverage.

My second attempt, I put "<![CDATA[<br><p>   A Brasil Telecom …
<br/><br/><br/>]]>" and I could send the docs to Solr, and, I could make a
search for "<br>" and retrieve the doc.



But consulting the result page source,  as you can see,

<str name="text">

&lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>

the html code was "changed".

It wasn't "changed" per se... but rather it was encoded. If you use an XML API to read the response you would not see these encoded characters.

. One with original content

. One with no html code, which will be indexed.



But I don't know how to preserve this html content in my new field. My
question is:

How to put these docs in Solr, search them, and retrieve de original <html>
content.

What are your searching needs? Are you really going to be searching on "<br>"? If so, you need to consider the analysis of the text sent to Solr carefully (look at the admin page analysis utility for insight). Regardless of what gets indexed, you can always store and retrieve the original text as long the field is marked as stored.

        Erik

Reply via email to