Hello everybody Meanwhile, I checked this issue in detail: we use pdftotext to extract text from our PDFs (<http://cds.cern.ch/>). Some generated text files contain \uFFFF and \uD835.
unicode(text, 'utf-8') does not throw any exception for these texts. Subsequently, Solr throws an exception when these are sent to the indexer. Therefore, I wrote a little function to remove them after the unicode() call: +def remove_invalid_solr_characters(utext): + for char in CFG_SOLR_INVALID_CHAR_REPLACEMENTS: + try: + utext = utext.replace(char, CFG_SOLR_INVALID_CHAR_REPLACEMENTS[char]) + except: + pass + return utext with: +CFG_SOLR_INVALID_CHAR_REPLACEMENTS = { + u'\uFFFF' : u"", + u'\uD835' : u"" +} This works well in our production environment. Cheers, Patrick ________________________________________ From: Patrick Oliver Glauner [patrick.oliver.glau...@cern.ch] Sent: Friday, September 28, 2012 10:36 AM To: solr-user@lucene.apache.org Subject: RE: Indexing in Solr: invalid UTF-8 Thank you. I will check our textification process and see how to improve it. Patrick ________________________________________ From: Michael McCandless [luc...@mikemccandless.com] Sent: Wednesday, September 26, 2012 5:45 PM To: solr-user@lucene.apache.org Subject: Re: Indexing in Solr: invalid UTF-8 Python's unicode function takes an optional (keyword) "errors" argument, telling it what to do when an invalid UTF8 byte sequence is seen. The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='replace' or errors='ignore'. See http://docs.python.org/howto/unicode.html for details ... However, I agree with Robert: you should dig into why whatever process you used to extract the full text from your binary documents is producing invalid UTF-8 ... something is wrong with that process. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir <rcm...@gmail.com> wrote: > On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner > <patrick.oliver.glau...@cern.ch> wrote: >> Hi >> Thanks. But I see that 0xd835 is missing in this list (see my exceptions). >> >> What's the best way to get rid of all of them in Python? I am new to unicode >> in Python but I am sure that this use case is quite frequent. >> > > I don't really know python either: so I could be wrong here but are > you just taking these binary .PDF and .DOC files and treating them as > UTF-8 text and sending them to Solr? > > If so, I don't think that will work very well. Maybe instead try > parsing these binary files with something like Tika to get at the > actual content and send that? (it seems some people have developed > python integration for this, e.g. > http://redmine.djity.net/projects/pythontika/wiki)