And I mean that in the context of stealing content from sites that explicitly declare they don't want to be crawled. Robots.txt is to be followed.
> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > Hello, > > I was wondering if anyone could guide me on how to crawl the web and > ignore the robots.txt since I can not index some big sites. Or if someone > could point how to get around it. I read somewhere about a > protocol.plugin.check.robots > but that was for nutch. > > The way I index is > bin/post -c gettingstarted https://en.wikipedia.org/ > > but I can't index the site I'm guessing because of the robots.txt. > I can index with > bin/post -c gettingstarted http://lucene.apache.org/solr > > which I am guessing allows it. I was also wondering how to find the name of > the crawler bin/post uses.