)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi
>
In the mean time I have found a better solution at the moment is to test on
a site that allows users to crawl their site.
On Thu, Jun 1, 2017 at 5:26 PM David Choi wrote:
> I think you misunderstand the argument was about stealing content. Sorry
> but I think you need to read what people
he URLs visited, errors, duplicates, etc. The output of
> the crawl goes to Solr. That is how we did it with Ultraseek (before Solr
> existed).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>
> > On Jun 1, 2017,
PM Dave wrote:
> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi wrote:
> >
> > Hello,
> >
> > I w
Hello,
I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.
The way I index is
bin/post -c g