If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not. 

> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Reply via email to