Re: Solr response error 403 when I try to index medium.com articles

Jack Krupansky Wed, 30 Mar 2016 13:13:44 -0700

You could use the curl command to read a URL on Medium.com. That would let
you examine and control the headers to experiment.


Google is able to index Medium.

Check the URL and make sure it's not on one of the paths disallowed by
medium.com/robots.txt (the one you gave seems fine):

User-Agent: *
Disallow: /_/
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/*/edit
Sitemap: https://medium.com/sitemap/sitemap.xml



-- Jack Krupansky

On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> 403 means "forbidden"
>
> Something about the request Solr is sending -- or soemthing about the IP
> address Solr is connecting from when talking to medium.com -- is causing
> hte medium.com web server to reject the request.
>
> This is something that servers may choose to do if they detect (via
> headers, or missing headers, or reverse ip lookup, or other
> distinctive nuances of how the connection was made) that the
> client connecting to their server isn't a "human browser" (ie: firefox,
> chrome, safari) and is a Robot that they don't want to cooperate with (ie:
> they might be happy toserve their pages to the google-bot crawler, but not
> to some third-party they've never heard of.
>
> The specifics of how/why you might get a 403 for any given url are hard to
> debug -- it might literally depend on how many requests you've sent tothat
> domain in the past X hours.
>
> In general Solr's ContentStream indexing from remote hosts isn't inteded
> to be a super robust solution for crawling arbitrary websites on the web
> -- if that's your goal, then i would suggest you look into running a more
> robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more
> features and debugging options (notably: rate limiting) and use that code
> to feath the content, then push it to Solr.
>
>
> : Date: Tue, 29 Mar 2016 20:54:52 -0300
> : From: Jeferson dos Anjos <jefersonan...@packdocs.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Solr response error 403 when I try to index medium.com articles
> :
> : I'm trying to index some pages of the medium. But I get error 403. I
> : believe it is because the medium does not accept the user-agent solr. Has
> : anyone ever experienced this? You know how to change?
> :
> : I appreciate any help
> :
> : <lst name="responseHeader">
> : <int name="status">500</int>
> : <int name="QTime">94</int>
> : </lst>
> : <lst name="error">
> : <str name="msg">
> : Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : </str>
> : <str name="trace">
> : java.io.IOException: Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> : Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at java.security.AccessController.doPrivileged(Native Method) at
> : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> : Source) at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> : Source) at
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> : at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> : at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> : at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> : at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> :
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> : at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> : at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> : at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> : at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> : at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> : at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> : at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> : at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> : at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> : at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> : at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> : at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> : at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> : at org.eclipse.jetty.server.Server.handle(Server.java:368) at
> :
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> : at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> : at
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> : at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> : at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at
> : org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> : at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> : at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> : at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> : at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> : at java.lang.Thread.run(Unknown Source) Caused by:
> : java.io.IOException: Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getHeaderField(Unknown
> : Source) at java.net.URLConnection.getContentType(Unknown Source) at
> : sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(Unknown
> : Source) at
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:84)
> : ... 33 more
> : </str>
> : <int name="code">500</int>
> : </lst>
> : </response>
> :
> :
> : Jeferson M. dos Anjos
> : CEO do Packdocs
> : ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com)
> :
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Solr response error 403 when I try to index medium.com articles

Reply via email to