Sean - I found Heritrix is pretty easy to set up. I am testing it on my server here http://66.197.161.133:8081, and am trying to create crawl jobs. As of 'Heritrix writer', could you write the crawling results to XML or do you think inserting into MySQL would be better? And where can I find documentation for creating Heritrix writer? I really want to make it work for Solr.
Thanks! Tony On Fri, Mar 6, 2009 at 8:08 AM, Sean Timm <tim...@aol.com> wrote: > We too use Heritrix. We tried Nutch first but Nutch was not finding all > of the documents that it was supposed to. When Nutch and Heritrix were > both set to crawl our own site to a depth of three, Nutch missed some > pages that were linked directly from the seed. We ended up with 10%-20% > fewer pages in the Nutch crawl. > > It is pretty easy to add custom writers to Heritrix. We write our crawls > to MySQL and then ingest into Solr from there. It would not be hard to > write a Heritrix writer that writes directly to Solr however. > > -Sean > > Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote: > > We are using Heritrix, the Internet Archive’s open source crawler, which > is very easy to extend. We have augmented it with a custom parser to crawl > some specific data formats and coded our own processors (Heritrix’s > terminology for extensions) to link together different data sources as well > as to output xmls in the right format to feed to solr. We have not yet > created an automated path to feed the xmls into solr but we plan to. > > > > ~LB > > > > > > > > On 3/5/09 3:32 PM, "Tony Wang" <ivyt...@gmail.com> wrote: > > > > Hi, > > > > I wonder if there's any open source crawler product that could be > integrated > > with Solr. What crawler do you guys use? or you coded one by yourself? I > > have been trying to find out solutions for Nutch/Solr integration, but > > haven't got any luck yet. > > > > Could someone shed me some light? > > > > thanks! > > > > Tony > > > > -- > > Are you RCholic? www.RCholic.com > > 温 良 恭 俭 让 仁 义 礼 智 信 > > > > > -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信