Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
) > > > > On Jun 1, 2017, at 3:31 PM, David Choi wrote: > > > > In the mean time I have found a better solution at the moment is to test > on > > a site that allows users to crawl their site. > > > > On Thu, Jun 1, 2017 at 5:26 PM David Choi >

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In the mean time I have found a better solution at the moment is to test on a site that allows users to crawl their site. On Thu, Jun 1, 2017 at 5:26 PM David Choi wrote: > I think you misunderstand the argument was about stealing content. Sorry > but I think you need to read what people

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
he URLs visited, errors, duplicates, etc. The output of > the crawl goes to Solr. That is how we did it with Ultraseek (before Solr > existed). > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Jun 1, 2017,

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
PM Dave wrote: > And I mean that in the context of stealing content from sites that > explicitly declare they don't want to be crawled. Robots.txt is to be > followed. > > > On Jun 1, 2017, at 5:31 PM, David Choi wrote: > > > > Hello, > > > > I w

Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Hello, I was wondering if anyone could guide me on how to crawl the web and ignore the robots.txt since I can not index some big sites. Or if someone could point how to get around it. I read somewhere about a protocol.plugin.check.robots but that was for nutch. The way I index is bin/post -c g