I have a MySQL DB that I have Solr indexing and all is well.

However, one field I need to index is a text field that contains XML
stored in the DB. I read up on DIH Transformers a bit and I am
wondering... is there a way to have solr DIH either transform the XML
data or strip the XML out of the field as it indexes it leaving only
the textual data in solr's index?

This XML field is the body content of web site articles (don't ask
why, not my choice :-/) and it also has a lot of CDATA's wrapping HTML
in the XML. I want solr to index this data, minus all the markup.

Should I be using a RegexTransformer to strip tags (this feels like
the wrong approach) or would HTMLStripTransformer work? Is there an
XMLTransformer I don't know about?



Not sure about the cdata thing, but HTMLStripTranformer behaves like,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
, so it can be used to strip xml tags as well.

Reply via email to