You should look at Nutch apache solution that has Solr client support, it
has all the index options you need and has schema to build Solr collection
with all required fields for indexing.
We have used it and works well, supports sitemap.xml to simplify indexing.
On Fri, Apr 12, 2019 at 6:43 AM Ja
I think there may actually be a bug. I was not able to crawl some other web
site either.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
> 11. apr. 2019 kl. 18:55 skrev Erick Erickson :
>
> You are sending malformed XML to Solr. This can be something as silly as
> ha
You are sending malformed XML to Solr. This can be something as silly as having
extra spaces at the beginning. I’d capture the page being sent to Solr and put
it in a formatter to check it….
Best,
Erick
> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty
> wrote:
>
> Hello Team,
>
>
>
One of the files that post tool identified as XML is not. Possibly a 404
error or some such. So it is trying to parse the file and sees non-xml
content right at start. Or if you are sure it is an XML file, maybe there
is a BOM mark. Either way try to isolate the specific file.
On a bigger picture