Nutch is an excellent option.  It should feel very comfortable for people 
migrating away from the Google appliances.

Apache Droids is another possible way to approach, and I’ve found people using 
Heretrix or Manifold for various use cases (and usually in combination with 
other use cases where the extra overhead was worth the trouble).

I think the simples approach will be Nutch…it’s absolutely worth taking a shot 
at it.

DO NOT write a crawler!  That is a rabbit hole you do not want to peer down 
into :)



On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:

> Hi Eric,
> 
> We have also helped some government institution to replave their expensive 
> GSA with open source software. In our case we use Apache Nutch 1.7 to crawl 
> the websites and index to Apache Solr. It is very effective, robust and 
> scales easily with Hadoop if you have to. Nutch may not be the easiest tool 
> for the job but is very stable, feature rich and has an active community here 
> at Apache.
> 
> Cheers,
> 
> -----Original message-----
>> From:Palmer, Eric <epal...@richmond.edu>
>> Sent: Wednesday 30th October 2013 18:48
>> To: solr-user@lucene.apache.org
>> Subject: Replacing Google Mini Search Appliance with Solr?
>> 
>> Hello all,
>> 
>> Been lurking on the list for awhile.
>> 
>> We are at the end of life for replacing two google mini search appliances 
>> used to index our public web sites. Google is no longer selling the mini 
>> appliances and buying the big appliance is not cost beneficial.
>> 
>> http://search.richmond.edu/
>> 
>> We would run a solr replacement in linux (cents, redhat, similar) with open 
>> Java or Oracle Java.
>> 
>> Background
>> ==========
>> ~130 sites
>> only ~12,000 pages (at a depth of 3)
>> probably ~40,000 pages if we go to a depth of 4
>> 
>> We use key matches a lot. In solr terms these are elevated documents 
>> (elevations)
>> 
>> We would code a search query form in php and wrap it into our design 
>> (http://www.richmond.edu)
>> 
>> I have played with and love lucidworks and know that their $ solution works 
>> for our use cases but the cost model is not attractive for such a small 
>> collection.
>> 
>> So with solr what are my open source options and what are people's 
>> experiences crawling and indexing web sites with solr + crawler. I 
>> understand there is not a crawler with solr so that would have to be first 
>> up to get one working.
>> 
>> We can code in Java, PHP, Python etc. if we have to, but we don't want to 
>> write a crawler if we can avoid it.
>> 
>> thanks in advance for and information.
>> 
>> --
>> Eric Palmer
>> Web Services
>> U of Richmond
>> 
>> 

Reply via email to