Re: Parsing/indexing XML data

Yonik Seeley Fri, 14 Apr 2006 06:08:04 -0700

On 4/14/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've got some fields that will contain embedded XML. Two questions
> relating to that:
>
> 1. It appears as though I'll need to XML-escape the field data, as
> otherwise Solr complains about find a start tag (one of the embedded
> tags) before it finds the end tag for a field.
>
> Is this an expected constraint?


It's not just embedded XML you have to escape... even in normal text
you need to escape XML reserved characters such as  '<' or '&'.
For most text protocols, there is not much of a way around escaping stuff.

> And is XML-escaping the data the best way to handle it? This is kind
> of related to question #2...

If you are doing the escaping yourself rather than relying on an XML
library, wrapping things  in a CDATA section might be easier for you. 
That would prevent the necessity of escaping every character.
http://www.w3schools.com/xml/xml_cdata.asp

> 2. What would be the easiest way to ignore XML tag data while
> indexing these types of XML-containing fields?

The HTMLStrip* tokenizers might work for you, esp if the fields aren't
well formed XML or HTML.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

> It seems like I could
> define a new field type (e.g. text_xml) and set the associated
> tokenizer class to something new that I create. Though I'd have to
> un-escape the data (ick) before parsing it to skip tags.

You wouldn't have to un-escape the data before parsing because the XML
parser in Solr will have already done that.

-Yonik

Re: Parsing/indexing XML data

Reply via email to