Re: SOLR + Nutch set up (UNCLASSIFIED)

Walter Underwood Wed, 03 Aug 2016 17:04:10 -0700

Ah, the difference between open source and a product. With Ultraseek, we chose 
a solid, stable algorithm that worked well for 3000 customers. In open source, 
it is a research project for every single customer.


I love open source. I’ve brought Solr into Netflix and Chegg. But there is a 
clear difference between developer-driven and customer-driven software.

I first learned about bounded binary exponential backoff in the 
Digital/Intel/Xerox (“DIX”) Ethernet spec in 1980. It is a solid algorithm for 
events with a Poisson distribution, like packet arrival times or web page next 
change times. There is no need for configuring algorithms here, especially 
configurations that lead to an unstable estimate. The only meaningful choices 
are the minimum revisit time, the maximum revisit time, and the number of bins. 
Those will be different for CNN (a launch customer for Ultraseek) or Sun 
documentation (another launch customer). CNN news articles change minute by 
minute, new Sun documentation appeared weekly or monthly.

Sorry for the rant, but “you can fix the algorithm yourself” almost always 
means a bad installation, an unhappy admin, and another black eye for open 
source.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 4:07 PM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Depending on your settings, Nutch does this as well. It is even possible to 
> set up different inc/decremental values per mime-type. 
> The algorithms are pluggable and overridable at any point of interest. You 
> can go all the way.  
> 
> -----Original message-----
>> From:Walter Underwood <wun...@wunderwood.org>
>> Sent: Wednesday 3rd August 2016 20:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>> 
>> That’s good news.
>> 
>> It should reset the interval estimate on page change instead of slowly 
>> shortening it.
>> 
>> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
>> page had not changed.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
>>> 
>>> Nutch also has adaptive strategy:
>>> 
>>> This class implements an adaptive re-fetch algorithm. This works as
>>>> follows:
>>>> 
>>>>  - for pages that has changed since the last fetchTime, decrease their
>>>>  fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>>>  - for pages that haven't changed since the last fetchTime, increase
>>>>  their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>>>  If SYNC_DELTA property is true, then:
>>>>     - calculate a delta = fetchTime - modifiedTime
>>>>     - try to synchronize with the time of change, by shifting the next
>>>>     fetchTime by a fraction of the difference between the last modification
>>>>     time and the last fetch time. I.e. the next fetch time will be set to 
>>>> fetchTime
>>>>     + fetchInterval - delta * SYNC_DELTA_RATE
>>>>     - if the adjusted fetch interval is bigger than the delta, then 
>>>> fetchInterval
>>>>     = delta.
>>>>  - the minimum value of fetchInterval may not be smaller than
>>>>  MIN_INTERVAL (default is 1 minute).
>>>>  - the maximum value of fetchInterval may not be bigger than
>>>>  MAX_INTERVAL (default is 365 days).
>>>> 
>>>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>>>> the algorithm, so that the fetch interval either increases or decreases
>>>> infinitely, with little relevance to the page changes. Please use
>>>> main(String[])
>>>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>>>> method to test the values before applying them in a production system.
>>>> 
>>> 
>>> From:
>>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
>>> 
>>> 
>>> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
>>> 
>>>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>>>> in Ultraseek.
>>>> 
>>>> I think we were the only people who built an adaptive crawler for
>>>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>>>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>>>> answer me.
>>>> 
>>>> Ultraseek also has great support for sites that need login. If you use
>>>> that, you’ll need to find a way to do that with another crawler.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> Former Ultraseek Principal Engineer
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>>>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>>>> <kris.t.musshorn....@mail.mil> wrote:
>>>>> 
>>>>> CLASSIFICATION: UNCLASSIFIED
>>>>> 
>>>>> We are currently using ultraseek and looking to deprecate it in favor of
>>>> solr/nutch.
>>>>> Ultraseek runs all the time and auto detects when pages have changed and
>>>> automatically reindexes them.
>>>>> Is this possible with SOLR/nutch?
>>>>> 
>>>>> Thanks,
>>>>> Kris
>>>>> 
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> Kris T. Musshorn
>>>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>>>> US Army Research Lab
>>>>> Aberdeen Proving Ground
>>>>> Application Management & Development Branch
>>>>> 410-278-7251
>>>>> kris.t.musshorn....@mail.mil
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> 
>>>>> 
>>>>> 
>>>>> CLASSIFICATION: UNCLASSIFIED
>>>> 
>>>> 
>> 
>>

Re: SOLR + Nutch set up (UNCLASSIFIED)

Reply via email to