Oh well I guess its ok if a corporation does it but not someone wanting to learn more about the field. I actually have written a crawler before as well as the you know Inverted Index of how solr works but I just thought its architecture was better suited for scaling.
On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> wrote: > And I mean that in the context of stealing content from sites that > explicitly declare they don't want to be crawled. Robots.txt is to be > followed. > > > On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > > > Hello, > > > > I was wondering if anyone could guide me on how to crawl the web and > > ignore the robots.txt since I can not index some big sites. Or if someone > > could point how to get around it. I read somewhere about a > > protocol.plugin.check.robots > > but that was for nutch. > > > > The way I index is > > bin/post -c gettingstarted https://en.wikipedia.org/ > > > > but I can't index the site I'm guessing because of the robots.txt. > > I can index with > > bin/post -c gettingstarted http://lucene.apache.org/solr > > > > which I am guessing allows it. I was also wondering how to find the name > of > > the crawler bin/post uses. >