Re: solr 7.0.1: exception running post to crawl simple website

toby1851 Fri, 27 Oct 2017 09:03:30 -0700

Amrit Sarkar wrote
> The above is SAXParse, runtime exception. Nothing can be done at Solr end
> except curating your own data.


I'm trying to replace a solr-4.6.0 system (which has been working
brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same
problem.

I do not believe it is a data curation problem. (Even if it were, it's very
unfriendly just to bomb out with a stack trace. And it's seriously annoying
that there's a 14 line error message about a parsing problem, but it
entirely neglects to mention what it was trying to parse! Was it a file, a
URL...?)

Anyway, the symptoms I'm seeing are that a simple "post -c foo https://...";
works fine. But the moment I turn on recursion, it fails before fetching a
second page. It doesn't matter what the first page is. Really: when I made
no progress with the site that I'm actually trying to index, I tried another
of my sites, then Google, then eBay... In every case, I get something like
this:

$ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10
...
POSTed web resource https://www.ebay.co.uk (depth: 0)
... [ 10s delay ]
[Fatal Error] :1:1: Content is not allowed in prolog.
...

I've been looking at the code, and also what's going with strace. As far as
I can see, at the point where the exception occurs, we are parsing data (a
copy of the page, presumably) that has come from the solr server itself.
That appears to be a chunk of JSON with embedded XML. The inner XML does
look to at least start correctly. The fact that we're getting an error at
line 1 column 1 every single time makes me suspect that we're feeding the
wrong thing to the SAX parser.

Anyway, I'm going to go and look at nutch as I need something working very
soon.

But could somebody who is familiar with this code take another look? 

Cheers,

Toby.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: solr 7.0.1: exception running post to crawl simple website

Reply via email to