Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a commercial web 
spider for ten years. Occasionally, our customers did need to bypass portions 
of robots.txt. That was usually because of a poorly-maintained web server, or 
because our spider could safely crawl some content that would cause problems 
for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions of good 
web citizenship. Instead, start with sitemap.xml and crawl the preferred 
portions of a site.

https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>

If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for anything 
big, but I’d start with that for learning.

https://scrapy.org/ <https://scrapy.org/>

If you want to learn on a site with a lot of content, try ours, chegg.com But 
if your crawler gets out of hand, crawling too fast, we’ll block it. Any other 
site will do the same.

I would not base the crawler directly on Solr. A crawler needs a dedicated 
database to record the URLs visited, errors, duplicates, etc. The output of the 
crawl goes to Solr. That is how we did it with Ultraseek (before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote:
> 
> Oh well I guess its ok if a corporation does it but not someone wanting to
> learn more about the field. I actually have written a crawler before as
> well as the you know Inverted Index of how solr works but I just thought
> its architecture was better suited for scaling.
> 
> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> wrote:
> 
>> And I mean that in the context of stealing content from sites that
>> explicitly declare they don't want to be crawled. Robots.txt is to be
>> followed.
>> 
>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>>  I was wondering if anyone could guide me on how to crawl the web and
>>> ignore the robots.txt since I can not index some big sites. Or if someone
>>> could point how to get around it. I read somewhere about a
>>> protocol.plugin.check.robots
>>> but that was for nutch.
>>> 
>>> The way I index is
>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>> 
>>> but I can't index the site I'm guessing because of the robots.txt.
>>> I can index with
>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>> 
>>> which I am guessing allows it. I was also wondering how to find the name
>> of
>>> the crawler bin/post uses.
>> 

Reply via email to