Hi,
Crawl-Anywhere includes a customizable document processing pipeline.
Crawl-Anywhere can also cache original crawled pages and documents in a
mongodb database.
Best regards.
Dominique
Le 11/02/13 06:16, SivaKarthik a écrit :
Dear Erick,
Thanks for ur relpy..
ya..nutch can meet m
Hi,
I didn't see this question.
Yes, I confirm Crawl-Anywhere can crawl in distributed environment.
If you have several huge web sites to crawl, you can dispatch crawling
across several crawler engines. However, one single web site can only be
crawled by one crawler engine at a time.
This lim
Yes you can run CA on different machines.
In "Manage" you have to set target and engine for this to work.
I've never done this, so you have to contact the developer for more details.
SivaKarthik wrote
> Hi All,
> in our project, we need to download around millions of pages...
> so is there a
Have a look at Nutch2, it is decoupled from HDFS and can store docs in e.g.
HBase or other NoSql store.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
11. feb. 2013 kl. 06:16 skrev SivaKarthik :
> Dear Erick,
> Thanks for ur rel
Hi,
did you try Heritrix?
The documents are stored as html inside an warc file which can be
postprocessed easily.
Cheers,
Markus
On 11.02.2013 12:16, SivaKarthik wrote:
Dear Erick,
Thanks for ur relpy..
ya..nutch can meet my requirement...
but the problem is, i want to store th
Hi Siva,
You will probably get a better reply if you head over to the nutch mailing list
[http://nutch.apache.org/mailing_lists.html] and ask there.
Nutch 2.1 may be what you are looking for (stores pages in NoSQL database).
Regards,
Sujit
On Feb 10, 2013, at 9:16 PM, SivaKarthik wrote:
> De