Re: Parsing/indexing XML data

Ken Krugler Fri, 14 Apr 2006 08:22:52 -0700

Hi Yonik,

Thanks for the fast response.

 > I've got some fields that will contain embedded XML. Two questions

 relating to that:

 1. It appears as though I'll need to XML-escape the field data, as
 otherwise Solr complains about find a start tag (one of the embedded
 tags) before it finds the end tag for a field.

 Is this an expected constraint?


It's not just embedded XML you have to escape... even in normal text
you need to escape XML reserved characters such as  '<' or '&'.
For most text protocols, there is not much of a way around escaping stuff.

OK, just curious. What made it a bit confusing is that for my API, ifthe input is well-formed XML, then my XML parser is happy - it's onlyreserved XML characters that need to be escaped.

 > And is XML-escaping the data the best way to handle it? This is kind

 of related to question #2...


If you are doing the escaping yourself rather than relying on an XML
library, wrapping things  in a CDATA section might be easier for you.
That would prevent the necessity of escaping every character.
http://www.w3schools.com/xml/xml_cdata.asp

I'm using the org.apache.commons.lang.StringEscapeUtils class, whichseems to work pretty well for this.

 > 2. What would be the easiest way to ignore XML tag data while
 > indexing these types of XML-containing fields?

The HTMLStrip* tokenizers might work for you, esp if the fields aren't
well formed XML or HTML.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 It seems like I could
 define a new field type (e.g. text_xml) and set the associated
 tokenizer class to something new that I create. Though I'd have to
 un-escape the data (ick) before parsing it to skip tags.


You wouldn't have to un-escape the data before parsing because the XML
parser in Solr will have already done that.


Great advice - let me give that a try.

Thanks!

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Parsing/indexing XML data

Reply via email to