On 9 October 2012 17:42, Patrick Oliver Glauner
<patrick.oliver.glau...@cern.ch> wrote:
> Hello everybody
>
> Meanwhile, I checked this issue in detail: we use pdftotext to extract text 
> from our PDFs (<http://cds.cern.ch/>). Some generated text files contain 
> \uFFFF and \uD835.
>
> unicode(text, 'utf-8') does not throw any exception for these texts. 
> Subsequently, Solr throws an exception when these are sent to the indexer.

Off-topic, but this is because the Unicode escape sequence
'\uxxxx' is not being interpreted here. You have to explicitly
do that. Here is an example with '\u2018', the opening
quote (I did not have a font which covered '\ud835'). Please
note the difference between:
print unicode('\u2018')
\u2018

and

print unicode('\u2018').decode('unicode-escape')
‘
Regards,
Gora

Reply via email to