We are using Heritrix, the Internet Archive’s open source crawler, which is very easy to extend. We have augmented it with a custom parser to crawl some specific data formats and coded our own processors (Heritrix’s terminology for extensions) to link together different data sources as well as to output xmls in the right format to feed to solr. We have not yet created an automated path to feed the xmls into solr but we plan to.
~LB On 3/5/09 3:32 PM, "Tony Wang" <ivyt...@gmail.com> wrote: Hi, I wonder if there's any open source crawler product that could be integrated with Solr. What crawler do you guys use? or you coded one by yourself? I have been trying to find out solutions for Nutch/Solr integration, but haven't got any luck yet. Could someone shed me some light? thanks! Tony -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信