Please visit the Nutch project. It is a powerful crawler and can integrate with Solr.
http://nutch.apache.org/ > Hi Solr users, > > I hope you can help. We are migrating our intranet web site management > system to Windows 2008 and need a replacement for Index Server to do the > text searching. I am trying to establish if Lucene and Solr is a feasible > replacement, but I cannot find the answers to these questions: > > 1. Can Solr be set up to recursively index a folder containing an > indeterminate and variable large number of subfolders, containing files of > all types: XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations, > text files etc. If so, how? > 2. Can Solr be queried over the web and return a list of files that match a > search query entered by a user, and also return the abstracts for these > files, as well as 'hit highlighting'. If so, how? > 3. Can Solr be run as a service (like Index Server) that automatically > detects changes to the files within the indexed folder and updates the > index? If so, how? > > Thanks for your help > > Cathy Hemsley