Hello, I built my own crawler with Python, as I couldn't find (not complaining, probably didn't look hard enough) nutch documentation. I use BeautifulSoup, because the site is mostly based on Python/Django, and we like Python.
Writing one was good for us because we spent most of out time figuring out "what" to write ... how to fetch pages, which to choose, what data to store etc. It was an awesome exercise that really narrowed the definition of our project. It helped us define our solr schema and other parts of the project during development. If we knew exactly what sort of data to crawl, and exactly what we intended to save, I'm sure we would have pushed harder at figuring out nutch. If I was to refactor, I would give Heririx and Nutch good looks now. cheers gene Gene Campbell http:www.picante.co.nz gene at picante point co point nz http://www.travelbeen.com - "the social search engine for travel" On Tue, Mar 10, 2009 at 11:14 PM, Andrzej Bialecki <a...@getopt.org> wrote: > Sean Timm wrote: >> >> We too use Heritrix. We tried Nutch first but Nutch was not finding all >> of the documents that it was supposed to. When Nutch and Heritrix were >> both set to crawl our own site to a depth of three, Nutch missed some >> pages that were linked directly from the seed. We ended up with 10%-20% >> fewer pages in the Nutch crawl. > > FWIW, from a private conversation with Sean it seems that this was likely > related to the default configuration in Nutch, which collects only the first > 1000 outlinks from a page. This is an arbitrary and configurable limit, > introduced as a way to limit the impact of spam pages and to limit the size > of LinkDb. If a page hits this limit then indeed the symptoms that you > observe are missing (dropped) links. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >