Thanks Jack!
I will give it a try, even though I finally have a Nutch configuration that does exactly what I want it to do (except keeping an eye on updated and deleted documents).
Erlend On 19.01.11 16.52, Jack Krupansky wrote:
Take a look at Apache ManifoldCF (incubating, close to 0.1 release): http://incubator.apache.org/connectors/ In addition to a fairly sophisticated general web crawler which maintains the state of crawled web pages it has a file system crawler and crawlers for a variety of document repositories. It has an output connector that sends documents and delete requests to Solr Cell. -- Jack Krupansky -----Original Message----- From: Erlend Garåsen Sent: Wednesday, January 19, 2011 4:29 AM To: solr-user@lucene.apache.org Subject: How to keep a maintained index with crawled data We need a crawler for all web pages outside our CMS, but one crucial future seems to be missing in many of them - a way to detect changes in these documents. Say that you have run a daily crawler job for two months looking for new web pages to crawl in order to keep the Solr index updated. But suddenly a lot of pages where either changed or deleted, and now you have an outdated Solr index. In other words, we need to detect removed web pages and trigger a delete command to Solr. We also need to detect web pages which have been modified in order to update the Solr index. For me it seems that the Aperture web crawler is the only one with such futures. The crawler handler has methods for modified and removed documents: http://sourceforge.net/apps/trac/aperture/wiki/Crawlers Or is it possible to do similar things with the other crawlers such as Nutch? Many thanks in advance for all kinds of suggestions! Erlend
-- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050