On 4/14/06, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi all, > > I've got some fields that will contain embedded XML. Two questions > relating to that: > > 1. It appears as though I'll need to XML-escape the field data, as > otherwise Solr complains about find a start tag (one of the embedded > tags) before it finds the end tag for a field. > > Is this an expected constraint?
It's not just embedded XML you have to escape... even in normal text you need to escape XML reserved characters such as '<' or '&'. For most text protocols, there is not much of a way around escaping stuff. > And is XML-escaping the data the best way to handle it? This is kind > of related to question #2... If you are doing the escaping yourself rather than relying on an XML library, wrapping things in a CDATA section might be easier for you. That would prevent the necessity of escaping every character. http://www.w3schools.com/xml/xml_cdata.asp > 2. What would be the easiest way to ignore XML tag data while > indexing these types of XML-containing fields? The HTMLStrip* tokenizers might work for you, esp if the fields aren't well formed XML or HTML. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > It seems like I could > define a new field type (e.g. text_xml) and set the associated > tokenizer class to something new that I create. Though I'd have to > un-escape the data (ick) before parsing it to skip tags. You wouldn't have to un-escape the data before parsing because the XML parser in Solr will have already done that. -Yonik