Great, but there is any way to change solr header to set user-agent? 2016-03-30 17:13 GMT-03:00 Jack Krupansky <jack.krupan...@gmail.com>:
> You could use the curl command to read a URL on Medium.com. That would let > you examine and control the headers to experiment. > > Google is able to index Medium. > > Check the URL and make sure it's not on one of the paths disallowed by > medium.com/robots.txt (the one you gave seems fine): > > User-Agent: * > Disallow: /_/ > Disallow: /m/ > Disallow: /me/ > Disallow: /@me$ > Disallow: /@me/ > Disallow: /*/*/edit > Sitemap: https://medium.com/sitemap/sitemap.xml > > > > -- Jack Krupansky > > On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter <hossman_luc...@fucit.org > > > wrote: > > > > > 403 means "forbidden" > > > > Something about the request Solr is sending -- or soemthing about the IP > > address Solr is connecting from when talking to medium.com -- is causing > > hte medium.com web server to reject the request. > > > > This is something that servers may choose to do if they detect (via > > headers, or missing headers, or reverse ip lookup, or other > > distinctive nuances of how the connection was made) that the > > client connecting to their server isn't a "human browser" (ie: firefox, > > chrome, safari) and is a Robot that they don't want to cooperate with > (ie: > > they might be happy toserve their pages to the google-bot crawler, but > not > > to some third-party they've never heard of. > > > > The specifics of how/why you might get a 403 for any given url are hard > to > > debug -- it might literally depend on how many requests you've sent > tothat > > domain in the past X hours. > > > > In general Solr's ContentStream indexing from remote hosts isn't inteded > > to be a super robust solution for crawling arbitrary websites on the web > > -- if that's your goal, then i would suggest you look into running a more > > robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more > > features and debugging options (notably: rate limiting) and use that code > > to feath the content, then push it to Solr. > > > > > > : Date: Tue, 29 Mar 2016 20:54:52 -0300 > > : From: Jeferson dos Anjos <jefersonan...@packdocs.com> > > : Reply-To: solr-user@lucene.apache.org > > : To: solr-user@lucene.apache.org > > : Subject: Solr response error 403 when I try to index medium.com > articles > > : > > : I'm trying to index some pages of the medium. But I get error 403. I > > : believe it is because the medium does not accept the user-agent solr. > Has > > : anyone ever experienced this? You know how to change? > > : > > : I appreciate any help > > : > > : <lst name="responseHeader"> > > : <int name="status">500</int> > > : <int name="QTime">94</int> > > : </lst> > > : <lst name="error"> > > : <str name="msg"> > > : Server returned HTTP response code: 403 for URL: > > : > > > https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 > > : </str> > > : <str name="trace"> > > : java.io.IOException: Server returned HTTP response code: 403 for URL: > > : > > > https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 > > : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown > > : Source) at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > > : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) > > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source) > > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source) > > : at java.security.AccessController.doPrivileged(Native Method) at > > : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown > > : Source) at > > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown > > : Source) at > > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown > > : Source) at > > sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown > > : Source) at > > > org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87) > > : at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158) > > : at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > : at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) > > : at > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291) > > : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at > > : > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) > > : at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) > > : at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204) > > : at > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) > > : at > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) > > : at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) > > : at > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) > > : at > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) > > : at > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) > > : at > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) > > : at > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) > > : at > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) > > : at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > > : at > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > > : at > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) > > : at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > > : at org.eclipse.jetty.server.Server.handle(Server.java:368) at > > : > > > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) > > : at > > > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) > > : at > > > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) > > : at > > > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) > > : at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at > > : org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > > : at > > > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > > : at > > > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > > : at > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > > : at > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > > : at java.lang.Thread.run(Unknown Source) Caused by: > > : java.io.IOException: Server returned HTTP response code: 403 for URL: > > : > > > https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1 > > : at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown > > : Source) at > > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown > > : Source) at > > sun.net.www.protocol.http.HttpURLConnection.getHeaderField(Unknown > > : Source) at java.net.URLConnection.getContentType(Unknown Source) at > > : > sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(Unknown > > : Source) at > > > org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:84) > > : ... 33 more > > : </str> > > : <int name="code">500</int> > > : </lst> > > : </response> > > : > > : > > : Jeferson M. dos Anjos > > : CEO do Packdocs > > : ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com) > > : > > > > -Hoss > > http://www.lucidworks.com/ > > > -- Jeferson M. dos Anjos CEO do Packdocs ps.: Mantenha seus arquivos vivos com o Packdocs (www.packdocs.com)