Re: Solr Web Crawler - Robots.txt

Charlie Hull Fri, 02 Jun 2017 02:27:00 -0700

On 02/06/2017 00:56, Doug Turnbull wrote:

Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work

+1 for Scrapy; it was built by a team at Mydeco.com while we werebuilding their search backend and has gone from strength to strength since.


Cheers

Charlie


Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood <wun...@wunderwood.org>
wrote:

Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote:

In the mean time I have found a better solution at the moment is to test

on

a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com>

wrote:

I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
wrote:

Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a

commercial

web spider for ten years. Occasionally, our customers did need to

bypass

portions of robots.txt. That was usually because of a

poorly-maintained web

server, or because our spider could safely crawl some content that

would

cause problems for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions

of

good web citizenship. Instead, start with sitemap.xml and crawl the
preferred portions of a site.

https://www.sitemaps.org/index.html <

https://www.sitemaps.org/index.html>


If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for
anything big, but I’d start with that for learning.

https://scrapy.org/ <https://scrapy.org/>

If you want to learn on a site with a lot of content, try ours,

chegg.com

But if your crawler gets out of hand, crawling too fast, we’ll block

it.

Any other site will do the same.

I would not base the crawler directly on Solr. A crawler needs a
dedicated database to record the URLs visited, errors, duplicates,

etc. The

output of the crawl goes to Solr. That is how we did it with Ultraseek
(before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com>

wrote:


Oh well I guess its ok if a corporation does it but not someone

wanting

to

learn more about the field. I actually have written a crawler before

as

well as the you know Inverted Index of how solr works but I just

thought

its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>

wrote:

And I mean that in the context of stealing content from sites that
explicitly declare they don't want to be crawled. Robots.txt is to be
followed.

On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>

wrote:


Hello,

I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if

someone

could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the

name

of

the crawler bin/post uses.



---
This email has been checked for viruses by AVG.
http://www.avg.com



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Solr Web Crawler - Robots.txt

Reply via email to