On 02/06/2017 00:56, Doug Turnbull wrote:
Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work

+1 for Scrapy; it was built by a team at Mydeco.com while we were building their search backend and has gone from strength to strength since.

Cheers

Charlie

Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood <wun...@wunderwood.org>
wrote:

Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote:

In the mean time I have found a better solution at the moment is to test
on
a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com>
wrote:

I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
wrote:

Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a
commercial
web spider for ten years. Occasionally, our customers did need to
bypass
portions of robots.txt. That was usually because of a
poorly-maintained web
server, or because our spider could safely crawl some content that
would
cause problems for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions
of
good web citizenship. Instead, start with sitemap.xml and crawl the
preferred portions of a site.

https://www.sitemaps.org/index.html <
https://www.sitemaps.org/index.html>

If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for
anything big, but I’d start with that for learning.

https://scrapy.org/ <https://scrapy.org/>

If you want to learn on a site with a lot of content, try ours,
chegg.com
But if your crawler gets out of hand, crawling too fast, we’ll block
it.
Any other site will do the same.

I would not base the crawler directly on Solr. A crawler needs a
dedicated database to record the URLs visited, errors, duplicates,
etc. The
output of the crawl goes to Solr. That is how we did it with Ultraseek
(before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com>
wrote:

Oh well I guess its ok if a corporation does it but not someone
wanting
to
learn more about the field. I actually have written a crawler before
as
well as the you know Inverted Index of how solr works but I just
thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
wrote:

And I mean that in the context of stealing content from sites that
explicitly declare they don't want to be crawled. Robots.txt is to be
followed.

On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>
wrote:

Hello,

I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if
someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the
name
of
the crawler bin/post uses.







---
This email has been checked for viruses by AVG.
http://www.avg.com



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to