DIH Transform XML?

Matt Galvin Fri, 22 Apr 2011 10:59:04 -0700

Hello,

First post here... I spent some time researching this but can't seem
to find the answer I am looking for...


I have a MySQL DB that I have Solr indexing and all is well.

However, one field I need to index is a text field that contains XML
stored in the DB. I read up on DIH Transformers a bit and I am
wondering... is there a way to have solr DIH either transform the XML
data or strip the XML out of the field as it indexes it leaving only
the textual data in solr's index?

This XML field is the body content of web site articles (don't ask
why, not my choice :-/) and it also has a lot of CDATA's wrapping HTML
in the XML. I want solr to index this data, minus all the markup.

Should I be using a RegexTransformer to strip tags (this feels like
the wrong approach) or would HTMLStripTransformer work? Is there an
XMLTransformer I don't know about?

I have been reading this:

http://wiki.apache.org/solr/DataImportHandler

but I feel like I am missing something that would make this work.

My dataConfig is barebones ATM.

Any help is greatly appreciated.

Thanks,

Matt

DIH Transform XML?

Reply via email to