DIH does some good stuff, but it doesn't handle bad input very robustly
(actually, how could it intuit what "the right thing" is?). I'd consider
SolrJ coupled with a "forgiving" HTML parser, e.g.
http://sourceforge.net/projects/nekohtml/

<http://sourceforge.net/projects/nekohtml/>Best
Erick

On Sun, Nov 21, 2010 at 7:46 PM, lee carroll
<lee.a.carr...@googlemail.com>wrote:

> Hi,
>
> Can a URL based datasource in DIH return non xml. My pages being indexed
> are
> writen by many authors and will
> often be invalid xhtml. Can DIH cope with htis or will i need another
> approach ?
>
> thanks in advance Lee C
>

Reply via email to