On Thu, 2007-02-08 at 14:40 +0100, rubdabadub wrote: > Thorsten: > > First of all I read your lab idea with great interest as I am in need > of such crawler. However there are certain things that I like to > discuss. I am not sure what forum will be appropriate for this but I > will do my idea shooting here first then please tell me where should I > post further comments.
Since it is not an official lab project yet, I am unsure myself, but I think we should discuss details on [EMAIL PROTECTED] Please reply to to the labs ml. > > A vertical search engine that will focus on a specific set of data i.e > use solr for example cos it provides the maximum field flexibility > would greatly benefit from such crawler. i.e the next big technorati > or the next big event finding solution can use your crawler to crawl > feeds using a feed-plugin (maybe nutch plugins) or scrape websites for > event info using some x-path/xquery stuff (personally I think xpath is > a pain in the a... :-) This like you pointed out are surely some use cases for the crawler in combination with plugins. Another is the wget like crawl that application can use to export a static site (e.g. CMS, etc.). > > What I worry about is those issue that has to deal with > > - updating crawls Actually if you only see the crawl there is no differences between updating or any other crawl. > - how many threads per host should be configurable. > - scale etc. you mean a crawl cluster? > > All the maintainers headaches! That is why droids is a labs proposal. http://labs.apache.org/bylaws.html All apache committer have write access and when a lab is promoted, the files are moved over to the incubation area. > I know you will use as much code as > you can from Nutch plus are not planning to re-invent the wheel. But > wouldn't be much easier to jump into Sami's idea and make it better > and more stand-alone and still benefit from the Nutch community? I will start a thread on nutch dev and see whether or not it is possible to extract the crawler from the core, but the main idea is to keep droids simple. Imaging something like the following pseudo code: public void crawl(String url) { // resolving the stream InputStream stream = new URL(url).openStream(); // Lookup plugins that is registered for the stream Plugin plugin = lookupPlugin(stream); // extract links // link pattern matcher Links[] links = plugin.extractLinks(stream); // Match patterns plugins for storing/excluding links links = plugin.handleLinks(links); // pass the stream to the plugin for further processing plugin.main(stream); } > I > wonder wouldn't it be easy to push/purse a route where nutch crawler > becomes a standalone crawler? no? I read a post about it on the list. > Can you provide some links to get some background information? TIA. > I would like to hear more about how your plan will evolve in terms of > druid and why not join forces with Sami and co.? I am more familiar with solr then nutch I have to admit. Like said all committer have write access on droids and everybody is welcome to join the effort. Who knows maybe the first droid is a standalone nutch crawler with plugin extension points if some nutch committer joins the lab. Thanks rubdabadub for your feedback. salu2 > > Regards > > On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote: > > > rubdabadub wrote: > > > > Hi: > > > > > > > > Are there relatively stand-alone crawler that are > > > > suitable/customizable for Solr? has anyone done any trials.. I have > > > > seen some discussion about coocon crawler.. was that successfull? > > > > > > There's also integration path available for Nutch[1] that i plan to > > > integrate after 0.9.0 is out. > > > > sounds very nice, I just finished to read. Thanks. > > > > Today a submitted a proposal for an Apache Labs project called Apache > > Druids. > > > > http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser > > > > Basic idea is to create a flexible crawler framework. The core should be > > a simple crawler which could be easily expended by plugins. So if a > > project/app needs special processing for a crawled url one could write a > > plugin to implement the functionality. > > > > salu2 > > > > > > > > -- > > > Sami Siren > > > > > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html > > -- > > Thorsten Scherler thorsten.at.apache.org > > Open Source Java & XML consulting, training and solutions > > > > -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions