Re: escaping HTML tags within XML file

pulkitsinghal Sun, 25 Sep 2011 11:53:23 -0700

Assuming that the XML has the HTML as values inside fully formed tags like so:
<node><HTML></HTML></node> then I think that using the "HTML" field type in 
schema.xml for indexing/storing will allow you to do meaningful searches on the 
content of the HTML without getting confused by the HTML syntax itself.

If you have absolutely no need for the entire stored HTML when presenting 
results to the user then stripping out the syntax at index time makes sense. 
This will adversely affect highlighting of  that document field as well so just 
know your requirements.

If you don't want to present anything at all then don't store, just index and 
use the right field type (HTML) such that search results find the right 
document. Just because a field is helpful in finding the doc, doesn't mean 
folks always want to present it or store it.

With Data Import Handler a HTML stripping transformer is present so that it is 
removed before the indexer gets it's hands on things. I can't be sure if that 
is how you get your data into Solr.

- Pulkit

Sent from my iPhone

On Sep 25, 2011, at 8:00 AM, okayndc <bodymo...@gmail.com> wrote:

> Hello,
> 
> Was wondering if it is necessary to escape HTML tags within an XML file for
> indexing?  If so, seems like a large XML files with tons of HTML tags could
> get really messy (using CDATA).
> Has this been your experience?  Do you escape the HTML tags? If so, what
> technique do you use? Or do you leave the HTML tags in place without
> escaping them?
> 
> Thanks!

Re: escaping HTML tags within XML file

Reply via email to