Hello,

I built my own crawler with Python, as I couldn't find (not
complaining, probably didn't look hard enough)
nutch documentation.  I use BeautifulSoup, because the site is mostly
based on Python/Django, and we like
Python.

Writing one was good for us because we spent most of out time figuring
out "what" to write ... how to fetch
pages, which to choose, what data to store etc.  It was an awesome
exercise that really narrowed the
definition of our project.  It helped us define our solr schema and
other parts of the project during development.
If we knew exactly what sort of data to crawl, and exactly what we
intended to save, I'm sure we would have pushed
harder at figuring out nutch.

If I was to refactor, I would give Heririx and Nutch good looks now.

cheers
gene

Gene Campbell
http:www.picante.co.nz
gene at picante point co point nz

http://www.travelbeen.com - "the social search engine for travel"



On Tue, Mar 10, 2009 at 11:14 PM, Andrzej Bialecki <a...@getopt.org> wrote:
> Sean Timm wrote:
>>
>> We too use Heritrix. We tried Nutch first but Nutch was not finding all
>> of the documents that it was supposed to. When Nutch and Heritrix were
>> both set to crawl our own site to a depth of three, Nutch missed some
>> pages that were linked directly from the seed. We ended up with 10%-20%
>> fewer pages in the Nutch crawl.
>
> FWIW, from a private conversation with Sean it seems that this was likely
> related to the default configuration in Nutch, which collects only the first
> 1000 outlinks from a page. This is an arbitrary and configurable limit,
> introduced as a way to limit the impact of spam pages and to limit the size
> of LinkDb. If a page hits this limit then indeed the symptoms that you
> observe are missing (dropped) links.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to