Hi Erik,

Thank you for the response. Just for completeness of the thread
I'm going to process the xhtml off-line. Another approach could be to set up
a web service which DIH could call which returned xml from a html parser.
However for my purposes its just as easy to use curl and perl and then use
DIH

cheers Lee

On 22 November 2010 12:59, Erick Erickson <erickerick...@gmail.com> wrote:

> DIH does some good stuff, but it doesn't handle bad input very robustly
> (actually, how could it intuit what "the right thing" is?). I'd consider
> SolrJ coupled with a "forgiving" HTML parser, e.g.
> http://sourceforge.net/projects/nekohtml/
>
> <http://sourceforge.net/projects/nekohtml/>Best
> Erick
>
> On Sun, Nov 21, 2010 at 7:46 PM, lee carroll
> <lee.a.carr...@googlemail.com>wrote:
>
> > Hi,
> >
> > Can a URL based datasource in DIH return non xml. My pages being indexed
> > are
> > writen by many authors and will
> > often be invalid xhtml. Can DIH cope with htis or will i need another
> > approach ?
> >
> > thanks in advance Lee C
> >
>

Reply via email to