Hello everybody

Meanwhile, I checked this issue in detail: we use pdftotext to extract text 
from our PDFs (<http://cds.cern.ch/>). Some generated text files contain \uFFFF 
and \uD835.

unicode(text, 'utf-8') does not throw any exception for these texts. 
Subsequently, Solr throws an exception when these are sent to the indexer.

Therefore, I wrote a little function to remove them after the unicode() call:
+def remove_invalid_solr_characters(utext):
+    for char in CFG_SOLR_INVALID_CHAR_REPLACEMENTS:
+        try:
+            utext = utext.replace(char, 
CFG_SOLR_INVALID_CHAR_REPLACEMENTS[char])
+        except:
+            pass
+    return utext

with:
+CFG_SOLR_INVALID_CHAR_REPLACEMENTS = {
+    u'\uFFFF' : u"",
+    u'\uD835' : u""
+}

This works well in our production environment.

Cheers, Patrick
________________________________________
From: Patrick Oliver Glauner [patrick.oliver.glau...@cern.ch]
Sent: Friday, September 28, 2012 10:36 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing in Solr: invalid UTF-8

Thank you. I will check our textification process and see how to improve it.

Patrick


________________________________________
From: Michael McCandless [luc...@mikemccandless.com]
Sent: Wednesday, September 26, 2012 5:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing in Solr: invalid UTF-8

Python's unicode function takes an optional (keyword) "errors"
argument, telling it what to do when an invalid UTF8 byte sequence is
seen.

The default (errors='strict') is to throw the exceptions you're
seeing.  But you can also pass errors='replace' or errors='ignore'.

See http://docs.python.org/howto/unicode.html for details ...

However, I agree with Robert: you should dig into why whatever process
you used to extract the full text from your binary documents is
producing invalid UTF-8 ... something is wrong with that process.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir <rcm...@gmail.com> wrote:
> On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
> <patrick.oliver.glau...@cern.ch> wrote:
>> Hi
>> Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
>>
>> What's the best way to get rid of all of them in Python? I am new to unicode 
>> in Python but I am sure that this use case is quite frequent.
>>
>
> I don't really know python either: so I could be wrong here but are
> you just taking these binary .PDF and .DOC files and treating them as
> UTF-8 text and sending them to Solr?
>
> If so, I don't think that will work very well. Maybe instead try
> parsing these binary files with something like Tika to get at the
> actual content and send that? (it seems some people have developed
> python integration for this, e.g.
> http://redmine.djity.net/projects/pythontika/wiki)

Reply via email to