Toby, Your mention of "-recursive" causing a problem reminded me of a simple crawl (of the 7.0 Ref Guide) using bin/post I was trying to get to work the other day and couldn't.
The order of the parameters seems to make a difference with what error you get (this is using 7.1): 1. "./bin/post -c gettingstarted -delay 10 https://lucene.apache.org/solr/guide/7_0 -recursive" yields the stack trace in the previous message: POSTed web resource https://lucene.apache.org/solr/guide/7_0 (depth: 0) [Fatal Error] :1:1: Content is not allowed in prolog. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) ... 5 more 2. "./bin/post -c gettingstarted -delay 10 -recursive https://lucene.apache.org/solr/guide/7_0" yields: No files, directories, URLs, -d strings, or stdin were specified. See './bin/post -h' for usage instructions. 3. "./bin/post -c gettingstarted http://lucene.apache.org/solr/guide/7_0 -recursive -delay 10" yields: Unrecognized argument: 10 If this was intended to be a data file, it does not exist relative to /Applications/Solr/solr-7.1.0 4. "./bin/post -c gettingstarted -delay 10 https://lucene.apache.org/solr/guide/7_0" successfully gets the document, but only the single page at that URL. It does not extract any of the content of the page besides the title and metadata Tika adds. I'd say we should probably file a JIRA for it. If the parsing is wrong (as it seems to me to be), that's a different problem, but the fact you can't use recursive at all is a bug AFAICT. Cassandra On Fri, Oct 27, 2017 at 11:03 AM, toby1851 <t...@paccrat.org> wrote: > Amrit Sarkar wrote >> The above is SAXParse, runtime exception. Nothing can be done at Solr end >> except curating your own data. > > I'm trying to replace a solr-4.6.0 system (which has been working > brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same > problem. > > I do not believe it is a data curation problem. (Even if it were, it's very > unfriendly just to bomb out with a stack trace. And it's seriously annoying > that there's a 14 line error message about a parsing problem, but it > entirely neglects to mention what it was trying to parse! Was it a file, a > URL...?) > > Anyway, the symptoms I'm seeing are that a simple "post -c foo https://..." > works fine. But the moment I turn on recursion, it fails before fetching a > second page. It doesn't matter what the first page is. Really: when I made > no progress with the site that I'm actually trying to index, I tried another > of my sites, then Google, then eBay... In every case, I get something like > this: > > $ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10 > ... > POSTed web resource https://www.ebay.co.uk (depth: 0) > ... [ 10s delay ] > [Fatal Error] :1:1: Content is not allowed in prolog. > ... > > I've been looking at the code, and also what's going with strace. As far as > I can see, at the point where the exception occurs, we are parsing data (a > copy of the page, presumably) that has come from the solr server itself. > That appears to be a chunk of JSON with embedded XML. The inner XML does > look to at least start correctly. The fact that we're getting an error at > line 1 column 1 every single time makes me suspect that we're feeding the > wrong thing to the SAX parser. > > Anyway, I'm going to go and look at nutch as I need something working very > soon. > > But could somebody who is familiar with this code take another look? > > Cheers, > > Toby. > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html