Re: Solr Web Crawler - Robots.txt

Vivek Pathak Thu, 01 Jun 2017 14:47:20 -0700

I can help. We can chat in some freenode chatroom in an hour or so.Let me know where you hang out.


Thanks


Vivek


On 6/1/17 5:45 PM, Dave wrote:

If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not.

On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:

Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

Reply via email to