Re: Crawl Anywhere -

2013-05-22 Thread Dominique Bejean
Hi, Crawl-Anywhere includes a customizable document processing pipeline. Crawl-Anywhere can also cache original crawled pages and documents in a mongodb database. Best regards. Dominique Le 11/02/13 06:16, SivaKarthik a écrit : Dear Erick, Thanks for ur relpy.. ya..nutch can meet m

Re: Crawl Anywhere -

2013-05-22 Thread Dominique Bejean
Hi, I didn't see this question. Yes, I confirm Crawl-Anywhere can crawl in distributed environment. If you have several huge web sites to crawl, you can dispatch crawling across several crawler engines. However, one single web site can only be crawled by one crawler engine at a time. This lim

Re: Crawl Anywhere -

2013-02-11 Thread O. Klein
Yes you can run CA on different machines. In "Manage" you have to set target and engine for this to work. I've never done this, so you have to contact the developer for more details. SivaKarthik wrote > Hi All, > in our project, we need to download around millions of pages... > so is there a

Re: Crawl Anywhere -

2013-02-11 Thread Jan Høydahl
Have a look at Nutch2, it is decoupled from HDFS and can store docs in e.g. HBase or other NoSql store. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 11. feb. 2013 kl. 06:16 skrev SivaKarthik : > Dear Erick, > Thanks for ur rel

Re: Crawl Anywhere -

2013-02-10 Thread Markus.Mirsberger
Hi, did you try Heritrix? The documents are stored as html inside an warc file which can be postprocessed easily. Cheers, Markus On 11.02.2013 12:16, SivaKarthik wrote: Dear Erick, Thanks for ur relpy.. ya..nutch can meet my requirement... but the problem is, i want to store th

Re: Crawl Anywhere -

2013-02-10 Thread SUJIT PAL
Hi Siva, You will probably get a better reply if you head over to the nutch mailing list [http://nutch.apache.org/mailing_lists.html] and ask there. Nutch 2.1 may be what you are looking for (stores pages in NoSQL database). Regards, Sujit On Feb 10, 2013, at 9:16 PM, SivaKarthik wrote: > De