403 means "forbidden" Something about the request Solr is sending -- or soemthing about the IP address Solr is connecting from when talking to medium.com -- is causing hte medium.com web server to reject the request.
This is something that servers may choose to do if they detect (via headers, or missing headers, or reverse ip lookup, or other distinctive nuances of how the connection was made) that the client connecting to their server isn't a "human browser" (ie: firefox, chrome, safari) and is a Robot that they don't want to cooperate with (ie: they might be happy toserve their pages to the google-bot crawler, but not to some third-party they've never heard of. The specifics of how/why you might get a 403 for any given url are hard to debug -- it might literally depend on how many requests you've sent tothat domain in the past X hours. In general Solr's ContentStream indexing from remote hosts isn't inteded to be a super robust solution for crawling arbitrary websites on the web -- if that's your goal, then i would suggest you look into running a more robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more features and debugging options (notably: rate limiting) and use that code to feath the content, then push it to Solr. : Date: Tue, 29 Mar 2016 20:54:52 -0300 : From: Jeferson dos Anjos <jefersonan...@packdocs.com> : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Solr response error 403 when I try to index medium.com articles : : I'm trying to index some pages of the medium. But I get error 403. I : believe it is because the medium does not accept the user-agent solr. Has : anyone ever experienced this? You know how to change? : : I appreciate any help : : <lst name="responseHeader"> : <int name="status">500</int> : <int name="QTime">94</int> : </lst> : <lst name="error"> : <str name="msg"> : Server returned HTTP response code: 403 for URL: : https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 : </str> : <str name="trace"> : java.io.IOException: Server returned HTTP response code: 403 for URL: : https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown : Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source) : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source) : at java.security.AccessController.doPrivileged(Native Method) at : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown : Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown : Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown : Source) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown : Source) at org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87) : at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158) : at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) : at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) : at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291) : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at : org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) : at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) : at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204) : at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) : at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) : at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) : at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) : at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) : at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) : at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) : at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) : at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) : at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) : at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) : at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) : at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) : at org.eclipse.jetty.server.Server.handle(Server.java:368) at : org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) : at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) : at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) : at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) : at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at : org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) : at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) : at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) : at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) : at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) : at java.lang.Thread.run(Unknown Source) Caused by: : java.io.IOException: Server returned HTTP response code: 403 for URL: : https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 : at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown : Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown : Source) at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(Unknown : Source) at java.net.URLConnection.getContentType(Unknown Source) at : sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(Unknown : Source) at org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:84) : ... 33 more : </str> : <int name="code">500</int> : </lst> : </response> : : : Jeferson M. dos Anjos : CEO do Packdocs : ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com) : -Hoss http://www.lucidworks.com/