I'm far, far from an expert on this sort of thing, but my personal experience 1-year ago was that Nutch-1 was easier to use, and the blog post I link below suggests that the abstraction layer in Nutch-2 really costs some time. I expect that Nutch-2 has matured some since then, but going with Nutch-1 is not a bad choice.
http://digitalpebble.blogspot.com/2013/09/nutch-fight-17-vs-221.html There are other dogs in this fight, as shown by the SolrEcosystem wiki page: https://wiki.apache.org/solr/SolrEcosystem - Apache Manifold CF has a crawler for web pages and a GUI to configure and start things that must be done by hand for Nutch (unless there is a front-end I don't know about). Web crawling is not the prime reason for which Manifold CF exists. - Heritrix is a good crawler, dedicated to handling broad and incremental crawling well. - Narconex Collectors is sort of a toolkit for building such crawlers. - Aspire (by Search Technologies) seems a bit complex, but has a web crawler. Again it's more of a toolkit for building such crawlers. I sure which I knew which one to go with ;) Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -----Original Message----- From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com] Sent: Tuesday, February 16, 2016 10:58 AM To: solr-user@lucene.apache.org Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ? Markus, Ticket I run into is for Nutch2 and NUTCH-2197 is for Nutch1. Haven't been using Nutch for a while so cannot recommend version. Thanks, Emir On 16.02.2016 16:37, Markus Jelsma wrote: > Nutch has Solr 5 cloud support in trunk, i committed it earlier this month. > https://issues.apache.org/jira/browse/NUTCH-2197 > > Markus > > -----Original message----- >> From:Emir Arnautovic <emir.arnauto...@sematext.com> >> Sent: Tuesday 16th February 2016 16:26 >> To: solr-user@lucene.apache.org >> Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ? >> >> Hi, >> It is most common to use Nutch as crawler, but it seems that it still >> does not have support for SolrCloud (if I am reading this ticket >> correctly https://issues.apache.org/jira/browse/NUTCH-1662). Anyway, >> I would recommend Nutch with standard http client. >> >> Regards, >> Emir >> >> On 16.02.2016 16:02, Victor D'agostino wrote: >>> Hi >>> >>> I am building a Solr 5 architecture with 3 Solr nodes and 1 zookeeper. >>> The database backend is postgresql 9 on RHEL 6. >>> >>> I am looking for a free open-source crawler which use SolrJ. >>> >>> What do you guys recommend ? >>> >>> Best regards >>> Victor d'Agostino >>> >>> >>> >>> ________________ >>> Ce message et les éventuels documents joints peuvent contenir des >>> informations confidentielles. Au cas où il ne vous serait pas >>> destiné, nous vous remercions de bien vouloir le supprimer et en >>> aviser immédiatement l'expéditeur. Toute utilisation de ce message >>> non conforme à sa destination, toute diffusion ou publication, >>> totale ou partielle et quel qu'en soit le moyen est formellement >>> interdite. Les communications sur internet n'étant pas sécurisées, >>> l'intégrité de ce message n'est pas assurée et la société émettrice >>> ne peut être tenue pour responsable de son contenu. >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log >> Management Solr & Elasticsearch Support * http://sematext.com/ >> >> -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/