Hi Yonik,
Thanks for the fast response.
> I've got some fields that will contain embedded XML. Two questions
relating to that:
1. It appears as though I'll need to XML-escape the field data, as
otherwise Solr complains about find a start tag (one of the embedded
tags) before it finds the end tag for a field.
Is this an expected constraint?
It's not just embedded XML you have to escape... even in normal text
you need to escape XML reserved characters such as '<' or '&'.
For most text protocols, there is not much of a way around escaping stuff.
OK, just curious. What made it a bit confusing is that for my API, if
the input is well-formed XML, then my XML parser is happy - it's only
reserved XML characters that need to be escaped.
> And is XML-escaping the data the best way to handle it? This is kind
of related to question #2...
If you are doing the escaping yourself rather than relying on an XML
library, wrapping things in a CDATA section might be easier for you.
That would prevent the necessity of escaping every character.
http://www.w3schools.com/xml/xml_cdata.asp
I'm using the org.apache.commons.lang.StringEscapeUtils class, which
seems to work pretty well for this.
> 2. What would be the easiest way to ignore XML tag data while
> indexing these types of XML-containing fields?
The HTMLStrip* tokenizers might work for you, esp if the fields aren't
well formed XML or HTML.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
It seems like I could
define a new field type (e.g. text_xml) and set the associated
tokenizer class to something new that I create. Though I'd have to
un-escape the data (ick) before parsing it to skip tags.
You wouldn't have to un-escape the data before parsing because the XML
parser in Solr will have already done that.
Great advice - let me give that a try.
Thanks!
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"