I'm looking for a web crawler to use with Solr. The objective is to
crawl about a dozen public web sites regarding a specific topic.
After a lot of googling, I came across Heritrix, which seems to be the
most robust well supported open source crawler out there. Heritrix
has an integration with Nutch (NutchWax), but not with Solr. I'm
wondering if anybody can share any experience using Heritrix with Solr.
It seems that there are three options for integration:
1. Write a custom Heritrix "Writer" class which submits documents to
Solr for indexing.
2. Write an ARC to Sol input XML format converter to import the ARC
files.
3. Use the filesystem mirror writer and then another program to walk
the downloaded files.
Has anybody looked into this or have any suggestions on an alternative
approach? The optimal answer would be "You dummy, just use XXX to
crawl your web sites - there's no 'integration' required at all. Can
you believe the temerity? What a poltroon."
Yours in Revolution,
George