Heritrix and Solr

George Everitt Thu, 22 Nov 2007 07:42:14 -0800

I'm looking for a web crawler to use with Solr. The objective is tocrawl about a dozen public web sites regarding a specific topic.

After a lot of googling, I came across Heritrix, which seems to be themost robust well supported open source crawler out there. Heritrixhas an integration with Nutch (NutchWax), but not with Solr. I'mwondering if anybody can share any experience using Heritrix with Solr.


It seems that there are three options for integration:

1. Write a custom Heritrix "Writer" class which submits documents toSolr for indexing.2. Write an ARC to Sol input XML format converter to import the ARCfiles.3. Use the filesystem mirror writer and then another program to walkthe downloaded files.

Has anybody looked into this or have any suggestions on an alternativeapproach? The optimal answer would be "You dummy, just use XXX tocrawl your web sites - there's no 'integration' required at all. Canyou believe the temerity? What a poltroon."


Yours in Revolution,
George

Heritrix and Solr

Reply via email to