Hello Team,
I am working on solr for the first time and got the setup done.
Now I have created a core using command line and want to perform webcrawl of a
third party site.
If I try it with individual links, I am able to do the crawl and index it to
the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar
http://www.example.com
Now what I intend to do is to give a url and using the recursive option
(-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using
the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes
-jar post.jar http://www.example.com and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar
post.jar http://www.example.com
and I am getting the below error message.
Error:
POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not
allowed in prolog.
at
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more
I would be very grateful if anyone could get me to solve this issue I have been
trying to fix for a couple of days.
Regards,
ShivprasadS
Confidentiality Notice: This e-mail message, including any attachments, is for
the sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient, please
contact the sender by reply e-mail, delete and then destroy all copies of the
original message.