I'm far, far from an expert on this sort of thing, but my personal experience 
1-year ago was that Nutch-1 was easier to use, and the blog post I link below 
suggests that the abstraction layer in Nutch-2 really costs some time.    I 
expect that Nutch-2 has matured some since then, but going with Nutch-1 is not 
a bad choice.

http://digitalpebble.blogspot.com/2013/09/nutch-fight-17-vs-221.html

There are other dogs in this fight, as shown by the SolrEcosystem wiki page:

https://wiki.apache.org/solr/SolrEcosystem

- Apache Manifold CF has a crawler for web pages and a GUI to configure and 
start things that must be done by hand for Nutch (unless there is a front-end I 
don't know about).    Web crawling is not the prime reason for which Manifold 
CF exists.
- Heritrix is a good crawler, dedicated to handling broad and incremental 
crawling well.
- Narconex Collectors is sort of a toolkit for building such crawlers.
- Aspire (by Search Technologies) seems a bit complex, but has a web crawler.   
 Again it's more of a toolkit for building such crawlers.

I sure which I knew which one to go with ;)

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH



-----Original Message-----
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com] 
Sent: Tuesday, February 16, 2016 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ?

Markus,
Ticket I run into is for Nutch2 and NUTCH-2197 is for Nutch1.

Haven't been using Nutch for a while so cannot recommend version.

Thanks,
Emir

On 16.02.2016 16:37, Markus Jelsma wrote:
> Nutch has Solr 5 cloud support in trunk, i committed it earlier this month.
> https://issues.apache.org/jira/browse/NUTCH-2197
>
> Markus
>   
> -----Original message-----
>> From:Emir Arnautovic <emir.arnauto...@sematext.com>
>> Sent: Tuesday 16th February 2016 16:26
>> To: solr-user@lucene.apache.org
>> Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ?
>>
>> Hi,
>> It is most common to use Nutch as crawler, but it seems that it still 
>> does not have support for SolrCloud (if I am reading this ticket 
>> correctly https://issues.apache.org/jira/browse/NUTCH-1662). Anyway, 
>> I would recommend Nutch with standard http client.
>>
>> Regards,
>> Emir
>>
>> On 16.02.2016 16:02, Victor D'agostino wrote:
>>> Hi
>>>
>>> I am building a Solr 5 architecture with 3 Solr nodes and 1 zookeeper.
>>> The database backend is postgresql 9 on RHEL 6.
>>>
>>> I am looking for a free open-source crawler which use SolrJ.
>>>
>>> What do you guys recommend ?
>>>
>>> Best regards
>>> Victor d'Agostino
>>>
>>>
>>> 
>>> ________________
>>> Ce message et les éventuels documents joints peuvent contenir des 
>>> informations confidentielles. Au cas où il ne vous serait pas 
>>> destiné, nous vous remercions de bien vouloir le supprimer et en 
>>> aviser immédiatement l'expéditeur. Toute utilisation de ce message 
>>> non conforme à sa destination, toute diffusion ou publication, 
>>> totale ou partielle et quel qu'en soit le moyen est formellement 
>>> interdite. Les communications sur internet n'étant pas sécurisées, 
>>> l'intégrité de ce message n'est pas assurée et la société émettrice 
>>> ne peut être tenue pour responsable de son contenu.
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log 
>> Management Solr & Elasticsearch Support * http://sematext.com/
>>
>>

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/

Reply via email to