I think there may actually be a bug. I was not able to crawl some other web site either.
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 11. apr. 2019 kl. 18:55 skrev Erick Erickson <erickerick...@gmail.com>: > > You are sending malformed XML to Solr. This can be something as silly as > having extra spaces at the beginning. I’d capture the page being sent to Solr > and put it in a formatter to check it…. > > Best, > Erick > >> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <shivpras...@orioninc.com> >> wrote: >> >> Hello Team, >> >> >> I am working on solr for the first time and got the setup >> done. Now I have created a core using command line and want to perform >> webcrawl of a third party site. >> If I try it with individual links, I am able to do the crawl and index it to >> the core.This was done using > >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar >> http://www.example.com >> >> Now what I intend to do is to give a url and using the recursive option >> (-Drecursive) and let it crawl the entire site. >> Note that I am pointing to a website that has around 125 pages and I am >> using the below command > >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes >> -jar post.jar http://www.example.com and >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 >> -jar post.jar http://www.example.com >> >> and I am getting the below error message. >> Error: >> >> >> POSTed web resource http://www.example.com (depth: 0) >> [Fatal Error] :1:1: Content is not allowed in prolog. >> Exception in thread "main" java.lang.RuntimeException: >> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is >> not allowed in prolog. >> at >> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) >> at >> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) >> at >> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) >> at >> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) >> at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) >> at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) >> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; >> Content is not allowed in prolog. >> at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown >> Source) >> at >> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown >> Source) >> at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) >> at >> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) >> at >> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) >> ... 5 more >> >> >> >> I would be very grateful if anyone could get me to solve this issue I have >> been trying to fix for a couple of days. >> >> >> Regards, >> ShivprasadS >> >> >> Confidentiality Notice: This e-mail message, including any attachments, is >> for the sole use of the intended recipient(s) and may contain confidential >> and privileged information. Any unauthorized review, use, disclosure or >> distribution is prohibited. If you are not the intended recipient, please >> contact the sender by reply e-mail, delete and then destroy all copies of >> the original message. >