Hello Majisha, Nutch' Solr indexing plugin has support for stripping non-utf8 character codepoints from the input, but it does so only on the content field if i remember correctly.
However, that stripping method was not built with the invalid middle byte exception in mind, and i have not seen it even once before Solr 5.x. We are upgrading parts of our infrastructure to Solr 5.x and got struck by this too. Can you confirm that it is the content field sent by Nutch that causes the problem? Markus On Sunday 22 March 2015 16:04:07 Majisha Parambath wrote: > Hello, > > As part of an assignment, we initially crawled and collected NSF and NASA > Polar Datasets using Nutch. We used the nutch dump command to dump out the > segments that were created as part of the crawl. > Now we have to index this data into Solr. I am using java -jar post.jar > filename to post to solr however after the execution I do not see my file > indexed and checking the log I found exceptions which I am attaching with > this mail. > > Could you please let me know if I am missing something? > > Thanks and regards, > *Majisha Namath Parambath* > *Graduate Student, M.S in Computer Science* > *Viterbi School of Engineering* > *University of Southern California, Los Angeles*