Hello Majisha,

Nutch' Solr indexing plugin has support for stripping non-utf8 character 
codepoints from the input, but it does so only on the content field if i 
remember correctly.

However, that stripping method was not built with the invalid middle byte 
exception in mind, and i have not seen it even once before Solr 5.x. We are 
upgrading parts of our infrastructure to Solr 5.x and got struck by this too.

Can you confirm that it is the content field sent by Nutch that causes the 
problem?

Markus

On Sunday 22 March 2015 16:04:07 Majisha Parambath wrote:
> Hello,
> 
> As part of an assignment, we initially crawled and collected  NSF and NASA
> Polar Datasets using Nutch. We used the nutch dump command to dump out the
> segments that were created as part of the crawl.
> Now we have to index this data into Solr. I am using java -jar post.jar
> filename to post to solr however after the execution I do not see my file
> indexed and checking the log I found exceptions which I am attaching with
> this mail.
> 
> Could you please let me know if I am missing something?
> 
> Thanks and regards,
> *Majisha Namath Parambath*
> *Graduate Student, M.S in Computer Science*
> *Viterbi School of Engineering*
> *University of Southern California, Los Angeles*

Reply via email to