Re: Solr Newbie question: doubts about how to handle html content

Erik Hatcher Thu, 05 Oct 2006 05:47:59 -0700


On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:

My "current" problem is to know the best approach to handle contentwhich
have html code.
I have some docs that may or may not have html tag.



My first attempt, I defined a field "text" in my schema.xml :



 <field name="text" type="text" indexed="true" stored="true"/>
<field name="texto"> A Brasil Telecom … ]]></field>
But some docs that have html code throw an error when I tried tosend them
to Solr.

You must use CDATA or encode entities that have special meaning inXML. I assume you're building the XML to POST to Solr as simply astring. You definitely need to take encoding into consideration toavoid invalid XML. I suspect whatever language you're communicatingto Solr with has reasonable XML utilities you can leverage.

My second attempt, I put "<![CDATA[<br><p>   A Brasil Telecom …

 ]]>" and I could send the docs to Solr, and, Icould make a

search for "<br>" and retrieve the doc.



But consulting the result page source,  as you can see,

<str name="text">

&lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>

the html code was "changed".

It wasn't "changed" per se... but rather it was encoded. If you usean XML API to read the response you would not see these encodedcharacters.

. One with original content

. One with no html code, which will be indexed.



But I don't know how to preserve this html content in my new field. My
question is:

How to put these docs in Solr, search them, and retrieve deoriginal <html>

content.

What are your searching needs? Are you really going to be searchingon " "? If so, you need to consider the analysis of the textsent to Solr carefully (look at the admin page analysis utility forinsight). Regardless of what gets indexed, you can always store andretrieve the original text as long the field is marked as stored.


        Erik

Re: Solr Newbie question: doubts about how to handle html content

Reply via email to