Hi everyone! I've been following this thread and I realized we've constructed something similar to "Crawl Anywhere". The main difference is that our project is oriented to the digital libraries and digital repositories context. Specifically related to metadata collection from multiple sources, information improvements and storing in multiple destinations. So far, I can share an article about the project, because the code is in our development machines and testing servers. If everything goes well, we plan to make it open source in the near future. I'd be glad to hear your comments and opinions about it. There is no need to be polite. Thanks in advance.
Best regards. Nestor On Wed, Mar 2, 2011 at 11:46 AM, Dominique Bejean <dominique.bej...@eolya.fr > wrote: > Hi, > > No, it doesn't. It looks like to be a apache httpclient 3.x limitation. > https://issues.apache.org/jira/browse/HTTPCLIENT-579 > > Dominique > > Le 02/03/11 15:04, Thumuluri, Sai a écrit : > > Dominique, Does your crawler support NTLM2 authentication? We have content >> under SiteMinder which uses NTLM2 and that is posing challenges with Nutch? >> >> -----Original Message----- >> From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] >> Sent: Wednesday, March 02, 2011 6:22 AM >> To: solr-user@lucene.apache.org >> Subject: Re: [ANNOUNCE] Web Crawler >> >> Aditya, >> >> The crawler is not open source and won't be in the next future. Anyway, >> I have to change the license because it can be use for any personal or >> commercial projects. >> >> Sincerely, >> >> Dominique >> >> Le 02/03/11 10:02, findbestopensource a écrit : >> >>> Hello Dominique Bejean, >>> >>> Good job. >>> >>> We identified almost 8 open source web crawlers >>> http://www.findbestopensource.com/tagged/webcrawler I don't know how >>> far yours would be different from the rest. >>> >>> Your license states that it is not open source but it is free for >>> personnel use. >>> >>> Regards >>> Aditya >>> www.findbestopensource.com<http://www.findbestopensource.com> >>> >>> >>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean >>> <dominique.bej...@eolya.fr<mailto:dominique.bej...@eolya.fr>> wrote: >>> >>> Hi, >>> >>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java >>> Web Crawler. It includes : >>> >>> * a crawler >>> * a document processing pipeline >>> * a solr indexer >>> >>> The crawler has a web administration in order to manage web sites >>> to be crawled. Each web site crawl is configured with a lot of >>> possible parameters (no all mandatory) : >>> >>> * number of simultaneous items crawled by site >>> * recrawl period rules based on item type (html, PDF, ...) >>> * item type inclusion / exclusion rules >>> * item path inclusion / exclusion / strategy rules >>> * max depth >>> * web site authentication >>> * language >>> * country >>> * tags >>> * collections >>> * ... >>> >>> The pileline includes various ready to use stages (text >>> extraction, language detection, Solr ready to index xml writer, ...). >>> >>> All is very configurable and extendible either by scripting or >>> java coding. >>> >>> With scripting technology, you can help the crawler to handle >>> javascript links or help the pipeline to extract relevant title >>> and cleanup the html pages (remove menus, header, footers, ..) >>> >>> With java coding, you can develop your own pipeline stage stage >>> >>> The Crawl Anywhere web site provides good explanations and screen >>> shots. All is documented in a wiki. >>> >>> The current version is 1.1.4. You can download and try it out from >>> here : www.crawl-anywhere.com<http://www.crawl-anywhere.com> >>> >>> >>> Regards >>> >>> Dominique >>> >>> >>>