I think there may actually be a bug. I was not able to crawl some other web 
site either.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. apr. 2019 kl. 18:55 skrev Erick Erickson <erickerick...@gmail.com>:
> 
> You are sending malformed XML to Solr. This can be something as silly as 
> having extra spaces at the beginning. I’d capture the page being sent to Solr 
> and put it in a formatter to check it….
> 
> Best,
> Erick
> 
>> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <shivpras...@orioninc.com> 
>> wrote:
>> 
>> Hello Team,
>> 
>> 
>>               I am working on solr for the first time and got the setup 
>> done. Now I have created a core using command line and want to perform 
>> webcrawl of a third party site.
>> If I try it with individual links, I am able to do the crawl and index it to 
>> the core.This was done using >
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar 
>> http://www.example.com
>> 
>> Now what I intend to do is to give a url and using the recursive option 
>> (-Drecursive) and let it crawl the entire site.
>> Note that I am pointing to a website that has around 125 pages and I am 
>> using the below command >
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes 
>> -jar post.jar http://www.example.com  and
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 
>> -jar post.jar http://www.example.com
>> 
>> and I am getting the below error message.
>> Error:
>> 
>> 
>> POSTed web resource http://www.example.com (depth: 0)
>> [Fatal Error] :1:1: Content is not allowed in prolog.
>> Exception in thread "main" java.lang.RuntimeException: 
>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is 
>> not allowed in prolog.
>>       at 
>> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
>>       at 
>> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
>>       at 
>> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
>>       at 
>> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
>>       at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
>>       at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
>> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 
>> Content is not allowed in prolog.
>>       at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
>> Source)
>>       at 
>> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown 
>> Source)
>>       at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>>       at 
>> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
>>       at 
>> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
>>       ... 5 more
>> 
>> 
>> 
>> I would be very grateful if anyone could get me to solve this issue I have 
>> been trying to fix for a couple of days.
>> 
>> 
>> Regards,
>> ShivprasadS
>> 
>> 
>> Confidentiality Notice: This e-mail message, including any attachments, is 
>> for the sole use of the intended recipient(s) and may contain confidential 
>> and privileged information. Any unauthorized review, use, disclosure or 
>> distribution is prohibited. If you are not the intended recipient, please 
>> contact the sender by reply e-mail, delete and then destroy all copies of 
>> the original message.
> 

Reply via email to