Hi Erik, Thank you for the response. Just for completeness of the thread I'm going to process the xhtml off-line. Another approach could be to set up a web service which DIH could call which returned xml from a html parser. However for my purposes its just as easy to use curl and perl and then use DIH
cheers Lee On 22 November 2010 12:59, Erick Erickson <erickerick...@gmail.com> wrote: > DIH does some good stuff, but it doesn't handle bad input very robustly > (actually, how could it intuit what "the right thing" is?). I'd consider > SolrJ coupled with a "forgiving" HTML parser, e.g. > http://sourceforge.net/projects/nekohtml/ > > <http://sourceforge.net/projects/nekohtml/>Best > Erick > > On Sun, Nov 21, 2010 at 7:46 PM, lee carroll > <lee.a.carr...@googlemail.com>wrote: > > > Hi, > > > > Can a URL based datasource in DIH return non xml. My pages being indexed > > are > > writen by many authors and will > > often be invalid xhtml. Can DIH cope with htis or will i need another > > approach ? > > > > thanks in advance Lee C > > >