I can help. We can chat in some freenode chatroom in an hour or so.
Let me know where you hang out.
Thanks
Vivek
On 6/1/17 5:45 PM, Dave wrote:
If you are not capable of even writing your own indexing code, let alone
crawler, I would prefer that you just stop now. No one is going to help you
with this request, at least I'd hope not.
On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
Hello,
I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.
The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/
but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr
which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.