Re: Indexing in Solr: invalid UTF-8

2012-10-09 Thread Gora Mohanty
On 9 October 2012 17:42, Patrick Oliver Glauner wrote: > Hello everybody > > Meanwhile, I checked this issue in detail: we use pdftotext to extract text > from our PDFs (). Some generated text files contain > \u and \uD835. > > unicode(text, 'utf-8') does not throw any e

RE: Indexing in Solr: invalid UTF-8

2012-10-09 Thread Patrick Oliver Glauner
___ From: Patrick Oliver Glauner [patrick.oliver.glau...@cern.ch] Sent: Friday, September 28, 2012 10:36 AM To: solr-user@lucene.apache.org Subject: RE: Indexing in Solr: invalid UTF-8 Thank you. I will check our textification process and see how to improve it. Patrick _

RE: Indexing in Solr: invalid UTF-8

2012-09-28 Thread Patrick Oliver Glauner
Thank you. I will check our textification process and see how to improve it. Patrick From: Michael McCandless [luc...@mikemccandless.com] Sent: Wednesday, September 26, 2012 5:45 PM To: solr-user@lucene.apache.org Subject: Re: Indexing in Solr: invalid

Re: Indexing in Solr: invalid UTF-8

2012-09-26 Thread Michael McCandless
Python's unicode function takes an optional (keyword) "errors" argument, telling it what to do when an invalid UTF8 byte sequence is seen. The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='replace' or errors='ignore'. See http://docs.python.org

Re: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Robert Muir
On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner wrote: > Hi > Thanks. But I see that 0xd835 is missing in this list (see my exceptions). > > What's the best way to get rid of all of them in Python? I am new to unicode > in Python but I am sure that this use case is quite frequent. > I do

RE: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
elsma [markus.jel...@openindex.io] Sent: Tuesday, September 25, 2012 7:24 PM To: solr-user@lucene.apache.org; Patrick Oliver Glauner Subject: RE: Indexing in Solr: invalid UTF-8 Hi - you need to get rid of all non-character code points. http://unicode.org/cldr/utility/list-unicodeset.

RE: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Markus Jelsma
Indexing in Solr: invalid UTF-8 > > Hello > > We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, > DOC etc. Our indexing script is written in Python 2.4 using solrpy: > > [...] > text = remove_control_characters(text) # except \r, \

Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
Hello We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, DOC etc. Our indexing script is written in Python 2.4 using solrpy: [...] text = remove_control_characters(text) # except \r, \t, \n utext = unicode(text, 'utf-8') SOLR_CONNECTION.add(id=recid, fulltext=utext) [..