Re: Solr response error 403 when I try to index medium.com articles

Jeferson dos Anjos Wed, 30 Mar 2016 17:38:54 -0700

Great, but there is any way to change solr header to set user-agent?

2016-03-30 17:13 GMT-03:00 Jack Krupansky <[email protected]>:


> You could use the curl command to read a URL on Medium.com. That would let
> you examine and control the headers to experiment.
>
> Google is able to index Medium.
>
> Check the URL and make sure it's not on one of the paths disallowed by
> medium.com/robots.txt (the one you gave seems fine):
>
> User-Agent: *
> Disallow: /_/
> Disallow: /m/
> Disallow: /me/
> Disallow: /@me$
> Disallow: /@me/
> Disallow: /*/*/edit
> Sitemap: https://medium.com/sitemap/sitemap.xml
>
>
>
> -- Jack Krupansky
>
> On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter <[email protected]
> >
> wrote:
>
> >
> > 403 means "forbidden"
> >
> > Something about the request Solr is sending -- or soemthing about the IP
> > address Solr is connecting from when talking to medium.com -- is causing
> > hte medium.com web server to reject the request.
> >
> > This is something that servers may choose to do if they detect (via
> > headers, or missing headers, or reverse ip lookup, or other
> > distinctive nuances of how the connection was made) that the
> > client connecting to their server isn't a "human browser" (ie: firefox,
> > chrome, safari) and is a Robot that they don't want to cooperate with
> (ie:
> > they might be happy toserve their pages to the google-bot crawler, but
> not
> > to some third-party they've never heard of.
> >
> > The specifics of how/why you might get a 403 for any given url are hard
> to
> > debug -- it might literally depend on how many requests you've sent
> tothat
> > domain in the past X hours.
> >
> > In general Solr's ContentStream indexing from remote hosts isn't inteded
> > to be a super robust solution for crawling arbitrary websites on the web
> > -- if that's your goal, then i would suggest you look into running a more
> > robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more
> > features and debugging options (notably: rate limiting) and use that code
> > to feath the content, then push it to Solr.
> >
> >
> > : Date: Tue, 29 Mar 2016 20:54:52 -0300
> > : From: Jeferson dos Anjos <[email protected]>
> > : Reply-To: [email protected]
> > : To: [email protected]
> > : Subject: Solr response error 403 when I try to index medium.com
> articles
> > :
> > : I'm trying to index some pages of the medium. But I get error 403. I
> > : believe it is because the medium does not accept the user-agent solr.
> Has
> > : anyone ever experienced this? You know how to change?
> > :
> > : I appreciate any help
> > :
> > : <lst name="responseHeader">
> > : <int name="status">500</int>
> > : <int name="QTime">94</int>
> > : </lst>
> > : <lst name="error">
> > : <str name="msg">
> > : Server returned HTTP response code: 403 for URL:
> > :
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > : </str>
> > : <str name="trace">
> > : java.io.IOException: Server returned HTTP response code: 403 for URL:
> > :
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> > : Source) at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> > : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > : at java.security.AccessController.doPrivileged(Native Method) at
> > : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> > : Source) at
> > sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> > : Source) at
> >
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> > : at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> > : at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > : at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> > : at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> > : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> > :
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> > : at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> > : at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> > : at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> > : at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> > : at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> > : at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> > : at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> > : at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> > : at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> > : at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> > : at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> > : at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> > : at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> > : at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> > : at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> > : at org.eclipse.jetty.server.Server.handle(Server.java:368) at
> > :
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> > : at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> > : at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> > : at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> > : at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at
> > : org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> > : at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> > : at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> > : at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> > : at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> > : at java.lang.Thread.run(Unknown Source) Caused by:
> > : java.io.IOException: Server returned HTTP response code: 403 for URL:
> > :
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > : at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getHeaderField(Unknown
> > : Source) at java.net.URLConnection.getContentType(Unknown Source) at
> > :
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(Unknown
> > : Source) at
> >
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:84)
> > : ... 33 more
> > : </str>
> > : <int name="code">500</int>
> > : </lst>
> > : </response>
> > :
> > :
> > : Jeferson M. dos Anjos
> > : CEO do Packdocs
> > : ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com)
> > :
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>



-- 
Jeferson M. dos Anjos
CEO do Packdocs
ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com)

Re: Solr response error 403 when I try to index medium.com articles

Reply via email to