Thanks Jack!
I will give it a try, even though I finally have a Nutch configuration
that does exactly what I want it to do (except keeping an eye on updated
and deleted documents).
Erlend
On 19.01.11 16.52, Jack Krupansky wrote:
Take a look at Apache ManifoldCF (incubating, close to 0.1 re
Take a look at Apache ManifoldCF (incubating, close to 0.1 release):
http://incubator.apache.org/connectors/
In addition to a fairly sophisticated general web crawler which maintains
the state of crawled web pages it has a file system crawler and crawlers for
a variety of document repositories