Oh well I guess its ok if a corporation does it but not someone wanting to
learn more about the field. I actually have written a crawler before as
well as the you know Inverted Index of how solr works but I just thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> wrote:

> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> >
> > Hello,
> >
> >   I was wondering if anyone could guide me on how to crawl the web and
> > ignore the robots.txt since I can not index some big sites. Or if someone
> > could point how to get around it. I read somewhere about a
> > protocol.plugin.check.robots
> > but that was for nutch.
> >
> > The way I index is
> > bin/post -c gettingstarted https://en.wikipedia.org/
> >
> > but I can't index the site I'm guessing because of the robots.txt.
> > I can index with
> > bin/post -c gettingstarted http://lucene.apache.org/solr
> >
> > which I am guessing allows it. I was also wondering how to find the name
> of
> > the crawler bin/post uses.
>

Reply via email to