Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Jan Høydahl
bin/post is not a crawler, just a small java class that collects links from html pages using SolrCell. It respects very basic robots.txt but far from the full spec. This is just a local prototyping tool, not meant for production use. Jan Høydahl > 1. mar. 2020 kl. 09:27 skrev Mutuhprasannth : >

Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Mutuhprasannth
Have you found out the name of the crawler which is used by Solr bin/post or how to ignore robots.txt in Solr post tool -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Web Crawler - Robots.txt

2020-03-01 Thread Mutuhprasannth
Hi David Choi, Have you found out the name of the crawler which is used by Solr bin/post? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr Web Crawler - Robots.txt

2017-06-02 Thread Charlie Hull
On 02/06/2017 00:56, Doug Turnbull wrote: Scrapy is fantastic and I use it scrape search results pages for clients to take quality snapshots for relevance work +1 for Scrapy; it was built by a team at Mydeco.com while we were building their search backend and has gone from strength to strength

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Nutch was built for that, but it is a pain to use. I’m still sad that I couldn’t get Mike Lynch to open source Ultraseek. So easy and much more powerful than Nutch. Ignoring robots.txt is often a bad idea. You may get into a REST API or into a calendar that generates an unending number of valid

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Mike Drob
Isn't this exactly what Apache Nutch was built for? On Thu, Jun 1, 2017 at 6:56 PM, David Choi wrote: > In any case after digging further I have found where it checks for > robots.txt. Thanks! > > On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood > wrote: > > > Which was exactly what I suggested.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In any case after digging further I have found where it checks for robots.txt. Thanks! On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood wrote: > Which was exactly what I suggested. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Jun 1,

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Doug Turnbull
Scrapy is fantastic and I use it scrape search results pages for clients to take quality snapshots for relevance work Ignoring robots.txt sometimes legit comes up because a staging site might be telling google not to crawl but don't care about a developer crawling for internal purposes. Doug On T

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Which was exactly what I suggested. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 1, 2017, at 3:31 PM, David Choi wrote: > > In the mean time I have found a better solution at the moment is to test on > a site that allows users to crawl their

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
In the mean time I have found a better solution at the moment is to test on a site that allows users to crawl their site. On Thu, Jun 1, 2017 at 5:26 PM David Choi wrote: > I think you misunderstand the argument was about stealing content. Sorry > but I think you need to read what people write b

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
I think you misunderstand the argument was about stealing content. Sorry but I think you need to read what people write before making bold statements. On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood wrote: > Let’s not get snarky right away, especially when you are wrong. > > Corporations do not

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood
Let’s not get snarky right away, especially when you are wrong. Corporations do not generally ignore robots.txt. I worked on a commercial web spider for ten years. Occasionally, our customers did need to bypass portions of robots.txt. That was usually because of a poorly-maintained web server, o

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Oh well I guess its ok if a corporation does it but not someone wanting to learn more about the field. I actually have written a crawler before as well as the you know Inverted Index of how solr works but I just thought its architecture was better suited for scaling. On Thu, Jun 1, 2017 at 4:47 PM

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
And I mean that in the context of stealing content from sites that explicitly declare they don't want to be crawled. Robots.txt is to be followed. > On Jun 1, 2017, at 5:31 PM, David Choi wrote: > > Hello, > > I was wondering if anyone could guide me on how to crawl the web and > ignore the

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Vivek Pathak
I can help. We can chat in some freenode chatroom in an hour or so. Let me know where you hang out. Thanks Vivek On 6/1/17 5:45 PM, Dave wrote: If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now. No one is going to help

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave
If you are not capable of even writing your own indexing code, let alone crawler, I would prefer that you just stop now. No one is going to help you with this request, at least I'd hope not. > On Jun 1, 2017, at 5:31 PM, David Choi wrote: > > Hello, > > I was wondering if anyone could gui

Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi
Hello, I was wondering if anyone could guide me on how to crawl the web and ignore the robots.txt since I can not index some big sites. Or if someone could point how to get around it. I read somewhere about a protocol.plugin.check.robots but that was for nutch. The way I index is bin/post -c g