Re: crawler feed?

rubdabadub Thu, 08 Feb 2007 05:41:12 -0800

Thorsten:

First of all I read your lab idea with great interest as I am in need
of such crawler. However there are certain things that I like to
discuss. I am not sure what forum will be appropriate for this but I
will do my idea shooting here first then please tell me where should I
post further comments.


A vertical search engine that will focus on a specific set of data i.e
use solr for example cos it provides the maximum field flexibility
would greatly benefit from such crawler. i.e the next big technorati
or the next big event finding solution can use your crawler to crawl
feeds using a feed-plugin (maybe nutch plugins) or scrape websites for
event info using some x-path/xquery stuff (personally I think xpath is
a pain in the a... :-)

What I worry about is those issue that has to deal with

- updating crawls
- how many threads per host
- scale etc.

All the maintainers headaches!  I know you will use as much code as
you can from Nutch plus are not planning to re-invent the wheel. But
wouldn't be much easier to jump into Sami's idea and make it better
and more stand-alone and still benefit from the Nutch community? I
wonder wouldn't it be easy to push/purse a route where nutch crawler
becomes a standalone crawler? no? I read a post about it on the list.

I would like to hear more about how your plan will evolve in terms of
druid and why not join forces with Sami and co.?

Regards

On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:

On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> rubdabadub wrote:
> > Hi:
> >
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
>
> There's also integration path available for Nutch[1] that i plan to
> integrate after 0.9.0 is out.

sounds very nice, I just finished to read. Thanks.

Today a submitted a proposal for an Apache Labs project called Apache
Druids.

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

Basic idea is to create a flexible crawler framework. The core should be
a simple crawler which could be easily expended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality.

salu2

>
> --
>  Sami Siren
>
> 
[1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions

Re: crawler feed?

Reply via email to