Markus and Jason thanks for the info.
I will start to research Nutch. Writing a crawler, agree it is a rabbit hole. -- Eric Palmer Web Services U of Richmond To report technical issues, obtain technical support or make requests for enhancements please visit http://web.richmond.edu/contact/technical-support.html On 10/30/13 2:53 PM, "Jason Hellman" <jhell...@innoventsolutions.com> wrote: >Nutch is an excellent option. It should feel very comfortable for people >migrating away from the Google appliances. > >Apache Droids is another possible way to approach, and I¹ve found people >using Heretrix or Manifold for various use cases (and usually in >combination with other use cases where the extra overhead was worth the >trouble). > >I think the simples approach will be NutchŠit¹s absolutely worth taking a >shot at it. > >DO NOT write a crawler! That is a rabbit hole you do not want to peer >down into :) > > > >On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jel...@openindex.io> >wrote: > >> Hi Eric, >> >> We have also helped some government institution to replave their >>expensive GSA with open source software. In our case we use Apache Nutch >>1.7 to crawl the websites and index to Apache Solr. It is very >>effective, robust and scales easily with Hadoop if you have to. Nutch >>may not be the easiest tool for the job but is very stable, feature rich >>and has an active community here at Apache. >> >> Cheers, >> >> -----Original message----- >>> From:Palmer, Eric <epal...@richmond.edu> >>> Sent: Wednesday 30th October 2013 18:48 >>> To: solr-user@lucene.apache.org >>> Subject: Replacing Google Mini Search Appliance with Solr? >>> >>> Hello all, >>> >>> Been lurking on the list for awhile. >>> >>> We are at the end of life for replacing two google mini search >>>appliances used to index our public web sites. Google is no longer >>>selling the mini appliances and buying the big appliance is not cost >>>beneficial. >>> >>> http://search.richmond.edu/ >>> >>> We would run a solr replacement in linux (cents, redhat, similar) with >>>open Java or Oracle Java. >>> >>> Background >>> ========== >>> ~130 sites >>> only ~12,000 pages (at a depth of 3) >>> probably ~40,000 pages if we go to a depth of 4 >>> >>> We use key matches a lot. In solr terms these are elevated documents >>>(elevations) >>> >>> We would code a search query form in php and wrap it into our design >>>(http://www.richmond.edu) >>> >>> I have played with and love lucidworks and know that their $ solution >>>works for our use cases but the cost model is not attractive for such a >>>small collection. >>> >>> So with solr what are my open source options and what are people's >>>experiences crawling and indexing web sites with solr + crawler. I >>>understand there is not a crawler with solr so that would have to be >>>first up to get one working. >>> >>> We can code in Java, PHP, Python etc. if we have to, but we don't want >>>to write a crawler if we can avoid it. >>> >>> thanks in advance for and information. >>> >>> -- >>> Eric Palmer >>> Web Services >>> U of Richmond >>> >>> >