On 9 October 2012 17:42, Patrick Oliver Glauner <patrick.oliver.glau...@cern.ch> wrote: > Hello everybody > > Meanwhile, I checked this issue in detail: we use pdftotext to extract text > from our PDFs (<http://cds.cern.ch/>). Some generated text files contain > \uFFFF and \uD835. > > unicode(text, 'utf-8') does not throw any exception for these texts. > Subsequently, Solr throws an exception when these are sent to the indexer.
Off-topic, but this is because the Unicode escape sequence '\uxxxx' is not being interpreted here. You have to explicitly do that. Here is an example with '\u2018', the opening quote (I did not have a font which covered '\ud835'). Please note the difference between: print unicode('\u2018') \u2018 and print unicode('\u2018').decode('unicode-escape') ‘ Regards, Gora