Re: Can a URL based datasource in DIH return non xml

2010-11-22 Thread lee carroll
Hi Erik, Thank you for the response. Just for completeness of the thread I'm going to process the xhtml off-line. Another approach could be to set up a web service which DIH could call which returned xml from a html parser. However for my purposes its just as easy to use curl and perl and then use

Re: Can a URL based datasource in DIH return non xml

2010-11-22 Thread Erick Erickson
DIH does some good stuff, but it doesn't handle bad input very robustly (actually, how could it intuit what "the right thing" is?). I'd consider SolrJ coupled with a "forgiving" HTML parser, e.g. http://sourceforge.net/projects/nekohtml/ Best Erick On Su