You might also want to look at the heritrix crawler too:

        http://crawler.archive.org/

I have written three crawlers in the past, all for RSS feeds, it is not easy. 
Happy to provide tips and help if you want to go down that route.

François

On Apr 8, 2011, at 1:53 AM, Andrea Campi wrote:

> On Fri, Apr 8, 2011 at 6:23 AM, Jens Mueller 
> <supidupi...@googlemail.com>wrote:
> 
>> Hello all,
>> 
>> thanks for your generous help.
>> 
>> I think I now know everything:  (What I want to do is to build a web
>> crawler
>> and index the documents found). I will start with the setup as suggested by
>> 
>> 
> Write a web crawler from scratch is... ambitious.
> Have you looked at Nutch (http://nutch.apache.org/)?  It uses Solr for
> indexing, it may help you get a head start.
> If you've never used Hadoop before it may take some getting used to, but I
> have helped a customer implement it and helped a couple of their devs
> (medium-seniority) get up to speed, and it didn't take them too long to get
> used to it.
> 
> Andrea

Reply via email to