DIH does some good stuff, but it doesn't handle bad input very robustly (actually, how could it intuit what "the right thing" is?). I'd consider SolrJ coupled with a "forgiving" HTML parser, e.g. http://sourceforge.net/projects/nekohtml/
<http://sourceforge.net/projects/nekohtml/>Best Erick On Sun, Nov 21, 2010 at 7:46 PM, lee carroll <lee.a.carr...@googlemail.com>wrote: > Hi, > > Can a URL based datasource in DIH return non xml. My pages being indexed > are > writen by many authors and will > often be invalid xhtml. Can DIH cope with htis or will i need another > approach ? > > thanks in advance Lee C >