I have some sort of same requirement where I need to move to a good crawler. 
Currently I am using a custom crawler, I mean my own crawler to crawl some 
public domains and uses Lucene to index all downloaded pages. After doing lots 
of research I came across JSpider with Lucene.
  ALso I was looking for Nutch for doing crawler job but I dont think that is 
possible, I mean feasible.
   
  - BR

"A. Banji Oyebisi" <[EMAIL PROTECTED]> wrote:
  I am interested in this too. any ideas?
    
A. Banji Oyebisi  Choicegen, LLC.  Email: [EMAIL PROTECTED]  Web URL: 
http://www.choicegen.com  Choicegen... Helping you make better choices!

      Notice:  This email message, together with any attachments, may contain 
information  of  Choicegen,  LLC.,  its subsidiaries  and  affiliated   
entities,  that may be confidential,  proprietary,  copyrighted  and/or legally 
privileged, and is intended solely for the use of the individual   or entity 
named in this message. If you are not the intended recipient, and have received 
this message in error, please immediately return this   by email and then 
delete it.        




George Everitt wrote:   I'm looking for a web crawler to use with Solr.  The 
objective is to crawl about a dozen public web sites regarding a specific 
topic. 

After a lot of googling, I came across Heritrix, which seems to be the most 
robust well supported open source crawler out there.   Heritrix has an 
integration with Nutch (NutchWax), but not with Solr.   I'm wondering if 
anybody can share any experience using Heritrix with Solr. 

It seems that there are three options for integration: 

1. Write a custom Heritrix "Writer" class which submits documents to Solr for 
indexing. 
2. Write an ARC to Sol input XML format converter to import the ARC files. 
3. Use the filesystem mirror writer and then another program to walk the 
downloaded files. 

Has anybody looked into this or have any suggestions on an alternative 
approach?  The optimal answer would be "You dummy, just use XXX to crawl your 
web sites - there's no 'integration' required at all.   Can you believe the 
temerity?   What a poltroon." 

Yours in Revolution, 
George 













       
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Reply via email to