See http://crawler.archive.org/faq.html#new_writer For other Heritrix questions, this should probably go to the Heritrix list.
-Sean Tony Wang wrote: > Sean - > > I found Heritrix is pretty easy to set up. I am testing it on my server here > http://66.197.161.133:8081, and am trying to create crawl jobs. As of > 'Heritrix writer', could you write the crawling results to XML or do you > think inserting into MySQL would be better? And where can I find > documentation for creating Heritrix writer? I really want to make it work > for Solr. > > Thanks! > Tony > > On Fri, Mar 6, 2009 at 8:08 AM, Sean Timm <tim...@aol.com> wrote: > > >> We too use Heritrix. We tried Nutch first but Nutch was not finding all >> of the documents that it was supposed to. When Nutch and Heritrix were >> both set to crawl our own site to a depth of three, Nutch missed some >> pages that were linked directly from the seed. We ended up with 10%-20% >> fewer pages in the Nutch crawl. >> >> It is pretty easy to add custom writers to Heritrix. We write our crawls >> to MySQL and then ingest into Solr from there. It would not be hard to >> write a Heritrix writer that writes directly to Solr however. >> >> -Sean >> >> Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote: >> >>> We are using Heritrix, the Internet Archive’s open source crawler, which >>> >> is very easy to extend. We have augmented it with a custom parser to crawl >> some specific data formats and coded our own processors (Heritrix’s >> terminology for extensions) to link together different data sources as well >> as to output xmls in the right format to feed to solr. We have not yet >> created an automated path to feed the xmls into solr but we plan to. >> >>> ~LB >>> >>> >>> >>> On 3/5/09 3:32 PM, "Tony Wang" <ivyt...@gmail.com> wrote: >>> >>> Hi, >>> >>> I wonder if there's any open source crawler product that could be >>> >> integrated >> >>> with Solr. What crawler do you guys use? or you coded one by yourself? I >>> have been trying to find out solutions for Nutch/Solr integration, but >>> haven't got any luck yet. >>> >>> Could someone shed me some light? >>> >>> thanks! >>> >>> Tony >>> >>> -- >>> Are you RCholic? www.RCholic.com >>> 温 良 恭 俭 让 仁 义 礼 智 信 >>> >>> >>> > > > >