Isn't this exactly what Apache Nutch was built for? On Thu, Jun 1, 2017 at 6:56 PM, David Choi <choi.davi...@gmail.com> wrote:
> In any case after digging further I have found where it checks for > robots.txt. Thanks! > > On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wun...@wunderwood.org> > wrote: > > > Which was exactly what I suggested. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > > > > > On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > > > > > In the mean time I have found a better solution at the moment is to > test > > on > > > a site that allows users to crawl their site. > > > > > > On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> > > wrote: > > > > > >> I think you misunderstand the argument was about stealing content. > Sorry > > >> but I think you need to read what people write before making bold > > >> statements. > > >> > > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood < > wun...@wunderwood.org> > > >> wrote: > > >> > > >>> Let’s not get snarky right away, especially when you are wrong. > > >>> > > >>> Corporations do not generally ignore robots.txt. I worked on a > > commercial > > >>> web spider for ten years. Occasionally, our customers did need to > > bypass > > >>> portions of robots.txt. That was usually because of a > > poorly-maintained web > > >>> server, or because our spider could safely crawl some content that > > would > > >>> cause problems for other crawlers. > > >>> > > >>> If you want to learn crawling, don’t start by breaking the > conventions > > of > > >>> good web citizenship. Instead, start with sitemap.xml and crawl the > > >>> preferred portions of a site. > > >>> > > >>> https://www.sitemaps.org/index.html < > > https://www.sitemaps.org/index.html> > > >>> > > >>> If the site blocks you, find a different site to learn on. > > >>> > > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it > for > > >>> anything big, but I’d start with that for learning. > > >>> > > >>> https://scrapy.org/ <https://scrapy.org/> > > >>> > > >>> If you want to learn on a site with a lot of content, try ours, > > chegg.com > > >>> But if your crawler gets out of hand, crawling too fast, we’ll block > > it. > > >>> Any other site will do the same. > > >>> > > >>> I would not base the crawler directly on Solr. A crawler needs a > > >>> dedicated database to record the URLs visited, errors, duplicates, > > etc. The > > >>> output of the crawl goes to Solr. That is how we did it with > Ultraseek > > >>> (before Solr existed). > > >>> > > >>> wunder > > >>> Walter Underwood > > >>> wun...@wunderwood.org > > >>> http://observer.wunderwood.org/ (my blog) > > >>> > > >>> > > >>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> > > wrote: > > >>>> > > >>>> Oh well I guess its ok if a corporation does it but not someone > > wanting > > >>> to > > >>>> learn more about the field. I actually have written a crawler before > > as > > >>>> well as the you know Inverted Index of how solr works but I just > > thought > > >>>> its architecture was better suited for scaling. > > >>>> > > >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> > > >>> wrote: > > >>>> > > >>>>> And I mean that in the context of stealing content from sites that > > >>>>> explicitly declare they don't want to be crawled. Robots.txt is to > be > > >>>>> followed. > > >>>>> > > >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> > > >>> wrote: > > >>>>>> > > >>>>>> Hello, > > >>>>>> > > >>>>>> I was wondering if anyone could guide me on how to crawl the web > and > > >>>>>> ignore the robots.txt since I can not index some big sites. Or if > > >>> someone > > >>>>>> could point how to get around it. I read somewhere about a > > >>>>>> protocol.plugin.check.robots > > >>>>>> but that was for nutch. > > >>>>>> > > >>>>>> The way I index is > > >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/ > > >>>>>> > > >>>>>> but I can't index the site I'm guessing because of the robots.txt. > > >>>>>> I can index with > > >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr > > >>>>>> > > >>>>>> which I am guessing allows it. I was also wondering how to find > the > > >>> name > > >>>>> of > > >>>>>> the crawler bin/post uses. > > >>>>> > > >>> > > >>> > > > > >