If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now. No one is going to help you with this request, at least I'd hope not.
> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > Hello, > > I was wondering if anyone could guide me on how to crawl the web and > ignore the robots.txt since I can not index some big sites. Or if someone > could point how to get around it. I read somewhere about a > protocol.plugin.check.robots > but that was for nutch. > > The way I index is > bin/post -c gettingstarted https://en.wikipedia.org/ > > but I can't index the site I'm guessing because of the robots.txt. > I can index with > bin/post -c gettingstarted http://lucene.apache.org/solr > > which I am guessing allows it. I was also wondering how to find the name of > the crawler bin/post uses.