Markus and Jason

thanks for the info.

I will start to research Nutch.  Writing a crawler, agree it is a rabbit
hole.


-- 
Eric Palmer

Web Services
U of Richmond

To report technical issues, obtain technical support or make requests for
enhancements please visit
http://web.richmond.edu/contact/technical-support.html





On 10/30/13 2:53 PM, "Jason Hellman" <jhell...@innoventsolutions.com>
wrote:

>Nutch is an excellent option.  It should feel very comfortable for people
>migrating away from the Google appliances.
>
>Apache Droids is another possible way to approach, and I¹ve found people
>using Heretrix or Manifold for various use cases (and usually in
>combination with other use cases where the extra overhead was worth the
>trouble).
>
>I think the simples approach will be NutchŠit¹s absolutely worth taking a
>shot at it.
>
>DO NOT write a crawler!  That is a rabbit hole you do not want to peer
>down into :)
>
>
>
>On Oct 30, 2013, at 10:54 AM, Markus Jelsma <markus.jel...@openindex.io>
>wrote:
>
>> Hi Eric,
>> 
>> We have also helped some government institution to replave their
>>expensive GSA with open source software. In our case we use Apache Nutch
>>1.7 to crawl the websites and index to Apache Solr. It is very
>>effective, robust and scales easily with Hadoop if you have to. Nutch
>>may not be the easiest tool for the job but is very stable, feature rich
>>and has an active community here at Apache.
>> 
>> Cheers,
>> 
>> -----Original message-----
>>> From:Palmer, Eric <epal...@richmond.edu>
>>> Sent: Wednesday 30th October 2013 18:48
>>> To: solr-user@lucene.apache.org
>>> Subject: Replacing Google Mini Search Appliance with Solr?
>>> 
>>> Hello all,
>>> 
>>> Been lurking on the list for awhile.
>>> 
>>> We are at the end of life for replacing two google mini search
>>>appliances used to index our public web sites. Google is no longer
>>>selling the mini appliances and buying the big appliance is not cost
>>>beneficial.
>>> 
>>> http://search.richmond.edu/
>>> 
>>> We would run a solr replacement in linux (cents, redhat, similar) with
>>>open Java or Oracle Java.
>>> 
>>> Background
>>> ==========
>>> ~130 sites
>>> only ~12,000 pages (at a depth of 3)
>>> probably ~40,000 pages if we go to a depth of 4
>>> 
>>> We use key matches a lot. In solr terms these are elevated documents
>>>(elevations)
>>> 
>>> We would code a search query form in php and wrap it into our design
>>>(http://www.richmond.edu)
>>> 
>>> I have played with and love lucidworks and know that their $ solution
>>>works for our use cases but the cost model is not attractive for such a
>>>small collection.
>>> 
>>> So with solr what are my open source options and what are people's
>>>experiences crawling and indexing web sites with solr + crawler. I
>>>understand there is not a crawler with solr so that would have to be
>>>first up to get one working.
>>> 
>>> We can code in Java, PHP, Python etc. if we have to, but we don't want
>>>to write a crawler if we can avoid it.
>>> 
>>> thanks in advance for and information.
>>> 
>>> --
>>> Eric Palmer
>>> Web Services
>>> U of Richmond
>>> 
>>> 
>

Reply via email to