Nutch was built for that, but it is a pain to use. I’m still sad that I 
couldn’t get Mike Lynch to open source Ultraseek. So easy and much more 
powerful than Nutch.

Ignoring robots.txt is often a bad idea. You may get into a REST API or into a 
calendar that generates an unending number of valid, different pages. Or the 
combinatorial explosion of diffs between revisions of a wiki page. Those are 
really fun.

There are some web servers that put a session ID in the path, so you get an 
endless set of URLs for the exact same page. We called those a “black hole” 
because it sucked spiders in and never let them out.

The comments in the Wikipedia robots.txt are instructive. For example, they 
allow access to the documentation for the REST API (Allow: /api/rest_v1/?doc) 
then disallow the other paths in the API (Disallow: /api)

https://en.wikipedia.org/robots.txt <https://en.wikipedia.org/robots.txt>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 4:58 PM, Mike Drob <md...@apache.org> wrote:
> 
> Isn't this exactly what Apache Nutch was built for?
> 
> On Thu, Jun 1, 2017 at 6:56 PM, David Choi <choi.davi...@gmail.com> wrote:
> 
>> In any case after digging further I have found where it checks for
>> robots.txt. Thanks!
>> 
>> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> Which was exactly what I suggested.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote:
>>>> 
>>>> In the mean time I have found a better solution at the moment is to
>> test
>>> on
>>>> a site that allows users to crawl their site.
>>>> 
>>>> On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com>
>>> wrote:
>>>> 
>>>>> I think you misunderstand the argument was about stealing content.
>> Sorry
>>>>> but I think you need to read what people write before making bold
>>>>> statements.
>>>>> 
>>>>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
>> wun...@wunderwood.org>
>>>>> wrote:
>>>>> 
>>>>>> Let’s not get snarky right away, especially when you are wrong.
>>>>>> 
>>>>>> Corporations do not generally ignore robots.txt. I worked on a
>>> commercial
>>>>>> web spider for ten years. Occasionally, our customers did need to
>>> bypass
>>>>>> portions of robots.txt. That was usually because of a
>>> poorly-maintained web
>>>>>> server, or because our spider could safely crawl some content that
>>> would
>>>>>> cause problems for other crawlers.
>>>>>> 
>>>>>> If you want to learn crawling, don’t start by breaking the
>> conventions
>>> of
>>>>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>>>>> preferred portions of a site.
>>>>>> 
>>>>>> https://www.sitemaps.org/index.html <
>>> https://www.sitemaps.org/index.html>
>>>>>> 
>>>>>> If the site blocks you, find a different site to learn on.
>>>>>> 
>>>>>> I like the looks of “Scrapy”, written in Python. I haven’t used it
>> for
>>>>>> anything big, but I’d start with that for learning.
>>>>>> 
>>>>>> https://scrapy.org/ <https://scrapy.org/>
>>>>>> 
>>>>>> If you want to learn on a site with a lot of content, try ours,
>>> chegg.com
>>>>>> But if your crawler gets out of hand, crawling too fast, we’ll block
>>> it.
>>>>>> Any other site will do the same.
>>>>>> 
>>>>>> I would not base the crawler directly on Solr. A crawler needs a
>>>>>> dedicated database to record the URLs visited, errors, duplicates,
>>> etc. The
>>>>>> output of the crawl goes to Solr. That is how we did it with
>> Ultraseek
>>>>>> (before Solr existed).
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>> 
>>>>>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>> Oh well I guess its ok if a corporation does it but not someone
>>> wanting
>>>>>> to
>>>>>>> learn more about the field. I actually have written a crawler before
>>> as
>>>>>>> well as the you know Inverted Index of how solr works but I just
>>> thought
>>>>>>> its architecture was better suited for scaling.
>>>>>>> 
>>>>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> And I mean that in the context of stealing content from sites that
>>>>>>>> explicitly declare they don't want to be crawled. Robots.txt is to
>> be
>>>>>>>> followed.
>>>>>>>> 
>>>>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I was wondering if anyone could guide me on how to crawl the web
>> and
>>>>>>>>> ignore the robots.txt since I can not index some big sites. Or if
>>>>>> someone
>>>>>>>>> could point how to get around it. I read somewhere about a
>>>>>>>>> protocol.plugin.check.robots
>>>>>>>>> but that was for nutch.
>>>>>>>>> 
>>>>>>>>> The way I index is
>>>>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>>>>> 
>>>>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>>>>> I can index with
>>>>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>>>>> 
>>>>>>>>> which I am guessing allows it. I was also wondering how to find
>> the
>>>>>> name
>>>>>>>> of
>>>>>>>>> the crawler bin/post uses.
>>>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>>> 
>> 

Reply via email to