We too use Heritrix. We tried Nutch first but Nutch was not finding all
of the documents that it was supposed to. When Nutch and Heritrix were
both set to crawl our own site to a depth of three, Nutch missed some
pages that were linked directly from the seed. We ended up with 10%-20%
fewer pages in the Nutch crawl.

It is pretty easy to add custom writers to Heritrix. We write our crawls
to MySQL and then ingest into Solr from there. It would not be hard to
write a Heritrix writer that writes directly to Solr however.

-Sean

Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote:
> We are using Heritrix, the Internet Archive’s open source crawler, which is 
> very easy to extend. We have augmented it with a custom parser to crawl some 
> specific data formats and coded our own processors (Heritrix’s terminology 
> for extensions) to link together different data sources as well as to output 
> xmls in the right format to feed to solr. We have not yet created an 
> automated path to feed the xmls into solr but we plan to.
>
> ~LB
>
>
>
> On 3/5/09 3:32 PM, "Tony Wang" <ivyt...@gmail.com> wrote:
>
> Hi,
>
> I wonder if there's any open source crawler product that could be integrated
> with Solr. What crawler do you guys use? or you coded one by yourself? I
> have been trying to find out solutions for Nutch/Solr integration, but
> haven't got any luck yet.
>
> Could someone shed me some light?
>
> thanks!
>
> Tony
>
> --
> Are you RCholic? www.RCholic.com
> 温 良 恭 俭 让 仁 义 礼 智 信
>
>   

Reply via email to