Hello, First post here... I spent some time researching this but can't seem to find the answer I am looking for...
I have a MySQL DB that I have Solr indexing and all is well. However, one field I need to index is a text field that contains XML stored in the DB. I read up on DIH Transformers a bit and I am wondering... is there a way to have solr DIH either transform the XML data or strip the XML out of the field as it indexes it leaving only the textual data in solr's index? This XML field is the body content of web site articles (don't ask why, not my choice :-/) and it also has a lot of CDATA's wrapping HTML in the XML. I want solr to index this data, minus all the markup. Should I be using a RegexTransformer to strip tags (this feels like the wrong approach) or would HTMLStripTransformer work? Is there an XMLTransformer I don't know about? I have been reading this: http://wiki.apache.org/solr/DataImportHandler but I feel like I am missing something that would make this work. My dataConfig is barebones ATM. Any help is greatly appreciated. Thanks, Matt