Re: Searching With UTF-8

2017-08-29 Thread Diego Ceccarelli
Hello Lawrence, Which type did you use in the solr schema for your fields? Cheers, Diego On Tue, Aug 29, 2017 at 5:34 PM, Elitzer, Lawrence < lelit...@lgsinnovations.com> wrote: > Hello! > > > > It seems I can correctly import (with DIH) UTF-8 characters such as J but &g

Searching With UTF-8

2017-08-29 Thread Elitzer, Lawrence
Hello! It seems I can correctly import (with DIH) UTF-8 characters such as J but I am unable to search on the fields containing the UTF-8 data. I have tried from the Solr admin backend to send just a J and even URL encode it in the q parameter I am specifying. How would I go about searching

RE: Invalid UTF-8 character 0xffff at char #17373581, byte #17539047

2017-02-28 Thread Markus Jelsma
y 28th February 2017 17:27 > To: solr-user@lucene.apache.org > Subject: Invalid UTF-8 character 0x at char #17373581, byte #17539047 > > Hello everyone, > > We use Solr (with Adobe Coldfusion) to index circa 60,000 pdfs, however the > daily refresh has been failing wit

Invalid UTF-8 character 0xffff at char #17373581, byte #17539047

2017-02-28 Thread Nick Way
Hello everyone, We use Solr (with Adobe Coldfusion) to index circa 60,000 pdfs, however the daily refresh has been failing with this error "Invalid UTF-8 character 0x at char #17373581, byte #17539047" [truncated - full error message is posted below] - - Can Solr be con

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-23 Thread didier deshommes
On Wed, Apr 22, 2015 at 4:17 PM, Yonik Seeley wrote: > On Wed, Apr 22, 2015 at 11:00 AM, didier deshommes > wrote: > > curl " > > > http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation > " > > -H "Content-type:application/json" > > You're telling Solr the body encodi

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread Pavel Hladik
Hi, our developers solved the problem. We are using Solarium and we had to learn Solarium to use selects with content-type: application/x-www-form-urlencoded Pavel -- View this message in context: http://lucene.472066.n3.nabble.com/Bad-contentType-for-search-handler-text-xml-charset-UTF-8

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread Yonik Seeley
On Wed, Apr 22, 2015 at 11:00 AM, didier deshommes wrote: > curl " > http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation"; > -H "Content-type:application/json" You're telling Solr the body encoding is JSON, but then you don't send any body. We could catch that error

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread didier deshommes
application/xml. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Apr 22, 2015, at 3:01 AM, bengates wrote: > > > Looks like Solarium hardcodes a default header "Content-Type: text/xml; > > charse

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread Walter Underwood
ult header "Content-Type: text/xml; > charset=utf-8" if none provided. > Removing it solves the problem. > > It seems that Solr 5.1 doesn't support this content-type. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Bad-con

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread bengates
Looks like Solarium hardcodes a default header "Content-Type: text/xml; charset=utf-8" if none provided. Removing it solves the problem. It seems that Solr 5.1 doesn't support this content-type. -- View this message in context: http://lucene.472066.n3.nabble.com/Bad-content

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-22 Thread bengates
e-for-search-handler-text-xml-charset-UTF-8-tp4200314p4201564.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-17 Thread Chris Hostetter
Off the cuff, it sounds like you are making a POST request to the SearchHandler (ie: /search or /query) and the Content-TYpe you are sending is "text/xml; charset=UTF-8" In the past SearchHandler might have ignored that Content-Type, but now that structured queries can be sent as

Re: Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-17 Thread Erick Erickson
-- > View this message in context: > http://lucene.472066.n3.nabble.com/Bad-contentType-for-search-handler-text-xml-charset-UTF-8-tp4200314.html > Sent from the Solr - User mailing list archive at Nabble.com.

Bad contentType for search handler :text/xml; charset=UTF-8

2015-04-17 Thread Pavel Hladik
ontentType-for-search-handler-text-xml-charset-UTF-8-tp4200314.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Update with non UTF-8 characters

2014-10-01 Thread Chris Hostetter
: I am indexing Solr 4.9.0 using the /update request handler and am getting : errors from Tika - Illegal IOException from : org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by : MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I FWIW: that error appears to

Update with non UTF-8 characters

2014-10-01 Thread Teague James
Hello! I am indexing Solr 4.9.0 using the /update request handler and am getting errors from Tika - Illegal IOException from org.apache.tika.parser.xml.DcXMLParser@74ce3bea which is caused by MalFormedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. I believe that this is the

Re: UTF-8 encoding problems while replicating an index using SolrCloud

2014-02-05 Thread David Santamauro
node kicks off an indexing and tries to replicate all the updates using the UpdateHandler. What we get instead is an error around a wrong UTF-8 encoding from the leader trying to call the /udpate endpoint on the replica: request: http://10.40.0.25:9765/skus/update?update.chain=custom&_vers

UTF-8 encoding problems while replicating an index using SolrCloud

2014-02-05 Thread Ugo Matrangolo
Hi, we are having problems with an installation of SolrCloud where a leader node kicks off an indexing and tries to replicate all the updates using the UpdateHandler. What we get instead is an error around a wrong UTF-8 encoding from the leader trying to call the /udpate endpoint on the replica

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-06 Thread Federico Chiacchiaretta
2013/8/6 Raymond Wiker > Ok, let me rephrase that slightly: does your database extraction include > BLOBs or CLOBs that are actually complete documents, that might be UTF-8 > encoded text? > > It definitely does, each entry I have in PostgreSQL has a field of type "text

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
Ok, let me rephrase that slightly: does your database extraction include BLOBs or CLOBs that are actually complete documents, that might be UTF-8 encoded text? >From the stack trace in your second post, it seems that the error occurs while parsing an XML file uploaded via the UpdateRequestHand

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
No, the content has no XML tags included (hope I understood what you were asking here). Federico 2013/8/5 Raymond Wiker > On Aug 5, 2013, at 20:12 , Federico Chiacchiaretta < > federico.c...@gmail.com> wrote: > > Hi Raymond, > > I agree with you, 0xfffe is a special character, that is why I wa

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
On Aug 5, 2013, at 20:12 , Federico Chiacchiaretta wrote: > Hi Raymond, > I agree with you, 0xfffe is a special character, that is why I was asking > how it's handled in solr. > In my document, 0xfffe does not appear at the beginning, it's in the > content. > > Just an update about testing I'm d

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Sundararaju, Shankar
The problem is that even though unicode point \u and \uFFFE are valid UTF-8 characters, they will not be parsed by standards conforming XML parsers. There is something called UTF-8 replacement character \uFFFD that can be used to replace such characters. While indexing docs, replace all such

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter wrote: > > : > 0xfffe is not a special character -- it is explicitly *not* a character in > : > Unicode at all, it is set asside as "not a character." specifically so > : > that the character 0xfeff can be used as a BOM, and if the BOM is read > : >

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Steve Rowe
d UTFs? A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well- formed representation in UTF-32, in UTF-16, and in UTF

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Chris Hostetter
: > 0xfffe is not a special character -- it is explicitly *not* a character in : > Unicode at all, it is set asside as "not a character." specifically so : > that the character 0xfeff can be used as a BOM, and if the BOM is read : > incorrectly, it will cause an error. : : XML doesnt allow contro

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Shawn Heisey
On 8/5/2013 12:12 PM, Federico Chiacchiaretta wrote: Hi Raymond, I agree with you, 0xfffe is a special character, that is why I was asking how it's handled in solr. In my document, 0xfffe does not appear at the beginning, it's in the content. I believe that 0xfffe not a valid UTF-8

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 11:42 AM, Chris Hostetter wrote: > > : I agree with you, 0xfffe is a special character, that is why I was asking > : how it's handled in solr. > : In my document, 0xfffe does not appear at the beginning, it's in the > : content. > > Unless i'm missunderstanding something (an

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Chris Hostetter
the content" of your database, then your database content (by definition) can not be UTF-8, because 0xfffe is not a character in Unicode. if you are able to index that content in a single node Sold+DIH+JDBC+postgress setup, then you are getting (un)lucky -- postgres isn't complaing that

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
wer. > > From the docs you linked i found: > > "This property is only relevent for server versions less than or equal to > > 7.2". > > > > I'm using version 9.1, I gave it a try but unfortunately I had no luck. > > Besides, I checked encoding sett

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Raymond Wiker
> From the docs you linked i found: > "This property is only relevent for server versions less than or equal to > 7.2". > > I'm using version 9.1, I gave it a try but unfortunately I had no luck. > Besides, I checked encoding settings on DB and it's UTF-8. &g

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
Hi Shawn, thanks for your answer. >From the docs you linked i found: "This property is only relevent for server versions less than or equal to 7.2". I'm using version 9.1, I gave it a try but unfortunately I had no luck. Besides, I checked encoding settings on DB and it's U

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Shawn Heisey
83/solr/archive/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > Invalid > UTF-8 character 0xfffe at char #416, byte #127) It sounds like your database is not using the UTF-8 character set, but the JDBC driver (or the driver-server combination) is not aware that the character set is dif

Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Federico Chiacchiaretta
; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #6755, byte #6143) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-08-01 Thread Shawn Heisey
, but it doesnt in 4.3. I brought up the issue on the dev list. Allowing a user to change the default character set would cause problems for SolrCloud or distributed search, because the requests generated by the server are UTF-8. The responder did say that he can imagine all the code for a sol

Invalid UTF-8 character 0xfffe during shard update

2013-08-01 Thread Federico Chiacchiaretta
: Invalid UTF-8 character 0xfffe at char #416, byte #127) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:402) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.update.SolrCmdDistributor$1.call

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-27 Thread Shawn Heisey
I brought up the issue on the dev list. Allowing a user to change the default character set would cause problems for SolrCloud or distributed search, because the requests generated by the server are UTF-8. The responder did say that he can imagine all the code for a solution that involves

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-26 Thread Shawn Heisey
ersion 3.5, but > it doesnt in 4.3. Version 3.5 didn't force UTF-8, which led to a TON of problems with misconfigured containers, notably tomcat. SOLR-4265 (first available in 4.1.0) fixed this problem. https://issues.apache.org/jira/browse/SOLR-4265 In your case, because you are actually

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-26 Thread Gustav
ge in context: http://lucene.472066.n3.nabble.com/Solr-4-3-1-only-accepts-UTF-8-encoded-queries-tp4080587p4080706.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-26 Thread Shawn Heisey
E9" and "cão" is encoded to "c%E3o". > My URLencoding in tomcat is "iso-8859-1", but when i do a query like that to > solr(?q="caf%E9") It returns the error {msg=URLDecoder: Invalid character > encoding detected after position 2 of query string /

Solr 4.3.1 only accepts UTF-8 encoded queries?

2013-07-26 Thread Gustav
". My URLencoding in tomcat is "iso-8859-1", but when i do a query like that to solr(?q="caf%E9") It returns the error {msg=URLDecoder: Invalid character encoding detected after position 2 of query string / form data (while parsing as UTF-8),code=400}. It works perfectly in my

java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #478803, byte #606190)

2013-04-11 Thread eakarsu
e to index next documents coming from nutch. Or even though I am new to SOLR, maybe, I can write update pre/post processor plugin to SORL update job to ignore XML errors. Do we have solution for this problem? I appreciate your help class java.io.CharConversionException] Invalid UTF-8 character 0xff

Re: Antw: Re: How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-19 Thread Andreas Kahl
gt; "Jack Krupansky" 18.10.2012 21:36 >>> Have you verified that the data was indexed properly (UTF-8 encoding)? Try a raw HTTP request using the browser or curl and see how that field looks in the resulting XML. -- Jack Krupansky -Original Message- From: Andreas Kahl Sent:

Re: Antw: Re: How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-18 Thread Jack Krupansky
Have you verified that the data was indexed properly (UTF-8 encoding)? Try a raw HTTP request using the browser or curl and see how that field looks in the resulting XML. -- Jack Krupansky -Original Message- From: Andreas Kahl Sent: Thursday, October 18, 2012 1:10 PM To: j

Antw: Re: How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-18 Thread Andreas Kahl
Jack, Thanks for the hint, but we have already set URIEncoding="UTF-8" on all our tomcats, too. Regards Andreas >>> "Jack Krupansky" 18.10.12 17.11 Uhr >>> It may be that your container does not have UTF-8 enabled. For example, with Tomcat you

Re: How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-18 Thread Jack Krupansky
It may be that your container does not have UTF-8 enabled. For example, with Tomcat you need something like: Make sure your "Connector" element has URIEncoding="UTF-8" (for Tomcat.) -- Jack Krupansky -Original Message- From: Andreas Kahl Sent: Thursday, Octob

How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-18 Thread Andreas Kahl
asian alphabets, so we need UTF-8. Now we have the problem that the string returned by marcXml = results.get(0).getFirstValue("marcxml").toString(); is not valid UTF-8, so the resulting XML-Document is not well formed. Here is what we do in Java: << ModifiableSolrPar

Re: Indexing in Solr: invalid UTF-8

2012-10-09 Thread Gora Mohanty
On 9 October 2012 17:42, Patrick Oliver Glauner wrote: > Hello everybody > > Meanwhile, I checked this issue in detail: we use pdftotext to extract text > from our PDFs (<http://cds.cern.ch/>). Some generated text files contain > \u and \uD835. > > unicode(text,

RE: Indexing in Solr: invalid UTF-8

2012-10-09 Thread Patrick Oliver Glauner
Hello everybody Meanwhile, I checked this issue in detail: we use pdftotext to extract text from our PDFs (<http://cds.cern.ch/>). Some generated text files contain \u and \uD835. unicode(text, 'utf-8') does not throw any exception for these texts. Subsequently, Solr thr

RE: Indexing in Solr: invalid UTF-8

2012-09-28 Thread Patrick Oliver Glauner
UTF-8 Python's unicode function takes an optional (keyword) "errors" argument, telling it what to do when an invalid UTF8 byte sequence is seen. The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='rep

Re: Indexing in Solr: invalid UTF-8

2012-09-26 Thread Michael McCandless
lace' or errors='ignore'. See http://docs.python.org/howto/unicode.html for details ... However, I agree with Robert: you should dig into why whatever process you used to extract the full text from your binary documents is producing invalid UTF-8 ... something is wrong with that process. Mike McCan

Re: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Robert Muir
quite frequent. > I don't really know python either: so I could be wrong here but are you just taking these binary .PDF and .DOC files and treating them as UTF-8 text and sending them to Solr? If so, I don't think that will work very well. Maybe instead try parsing these binary files w

RE: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
elsma [markus.jel...@openindex.io] Sent: Tuesday, September 25, 2012 7:24 PM To: solr-user@lucene.apache.org; Patrick Oliver Glauner Subject: RE: Indexing in Solr: invalid UTF-8 Hi - you need to get rid of all non-character code points. http://unicode.org/cldr/utility/list-unicodeset.

RE: Indexing in Solr: invalid UTF-8

2012-09-25 Thread Markus Jelsma
Indexing in Solr: invalid UTF-8 > > Hello > > We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, > DOC etc. Our indexing script is written in Python 2.4 using solrpy: > > [...] > text = remove_control_characters(text) # except \r, \

Indexing in Solr: invalid UTF-8

2012-09-25 Thread Patrick Oliver Glauner
Hello We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs, DOC etc. Our indexing script is written in Python 2.4 using solrpy: [...] text = remove_control_characters(text) # except \r, \t, \n utext = unicode(text, 'utf-8') SOLR_CONNECTION.add(id=recid, full

UTF-8 without BOM French characters issue

2012-08-31 Thread binoybt
the format "UTF-8 without BOM"? Is there a way to get out of this issue. French character : étaient état Thanks Binoy -- View this message in context: http://lucene.472066.n3.nabble.com/UTF-8-without-BOM-French-characters-issue-tp4004751.html Sent from the Solr - User mailing list

Re: UTF-8

2012-07-20 Thread Mark Miller
It varies. Last I used Tomcat (some years ago) it defaulted to the system default encoding and you had to use -Dfile.encoding... to get UTF-8. Jetty currently defaults to UTF-8. On Jul 17, 2012, at 11:12 PM, William Bell wrote: > -Dfile.encoding=UTF-8... Is this usually recommended for S

Re: UTF-8

2012-07-17 Thread Paul Libbrecht
My experience is that this property has made a whole lot of a difference. At least till solr 3.1. The servlet container has not been the only bit. paul Le 18 juil. 2012 à 05:12, William Bell a écrit : > -Dfile.encoding=UTF-8... Is this usually recommended for SOLR indexes? > >

UTF-8

2012-07-17 Thread William Bell
-Dfile.encoding=UTF-8... Is this usually recommended for SOLR indexes? Or is the encoding usually just handled by the servlet container like Jetty? -- Bill Bell billnb...@gmail.com cell 720-256-8076

Re: UTF-8 encoding

2012-04-04 Thread Erik Hatcher
if it needs improving. Thanks, Erik On Apr 4, 2012, at 04:29 , henri wrote: > I have finally solved my problem!! > > Did the following: > > added two lines in the /browse requestHandler > velocity.properties > text/html;charset=UTF-8 > > Moved velocity

Re: UTF-8 encoding

2012-04-04 Thread henri
I have finally solved my problem!! Did the following: added two lines in the /browse requestHandler velocity.properties text/html;charset=UTF-8 Moved velocity.properties from solr/conf/velocity to solr/conf Not being an expert, I am not 100% sure this is the "best" sol

Re: UTF-8 encoding

2012-03-30 Thread henri.gour...@laposte.net
Paul, velocity.properties are set. One thing I am not 100% sure about is where this file should reside? I have placed in in the example/solr/conf/velocity folder (where the .vm files reside). Cheers, Henri -- View this message in context: http://lucene.472066.n3.nabble.com/UTF-8-encoding

Re: UTF-8 encoding

2012-03-29 Thread Paul Libbrecht
Henri, look velocity.properties. I have there: > input.encoding = UTF-8 Do you also? This is the vm files' encodings. Of course also make sure you edit these files in UTF-8 (using jEdit made it trustable to me). paul Le 30 mars 2012 à 08:49, henri.gour...@laposte.net

Re: UTF-8 encoding

2012-03-29 Thread henri.gour...@laposte.net
OK, Ill try to provide more details: I am using solr-3.5.0 I am running the example provided in the package. Some of the modifications I have done in the various velocity/*.vm files have accents! It is those accents that show up garbled when I look at the results. The .vm files are utf-8 encoded

Re: UTF-8 encoding

2012-03-29 Thread Erick Erickson
I doubt that the pre-installed Jetty server has problems with UTF-8, although you haven't told us what version of Solr you're running on so it could be really old. And you also haven't told us why you think UTF-8 is a problem. How is this manifesting itself? Failed searches?

Re: UTF-8 encoding

2012-03-29 Thread henri.gour...@laposte.net
Thanks for the tips, but unfortunately, no progress so far. Reading through the Web, I guess that jetty has utf-8 problems! I guess that I will have to switch from the embedded (and pre installed -> easy) jetty server present in Solr in favor of Tomcat (for which I have to rediscover

Re: UTF-8 encoding

2012-03-29 Thread Paul Libbrecht
success or lack thereof, I'm interested and I am sure others are. paul Le 29 mars 2012 à 16:49, Bob Sandiford a écrit : > Hi, Henri. > > Make sure that the container in which you are running Solr is also set for > UTF-8. > > For example, in Tomcat, in the serve

RE: UTF-8 encoding

2012-03-29 Thread Bob Sandiford
Hi, Henri. Make sure that the container in which you are running Solr is also set for UTF-8. For example, in Tomcat, in the server.xml file, your Connector definitions should include: URIEncoding="UTF-8" Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.

UTF-8 encoding

2012-03-29 Thread henri.gour...@laposte.net
I cant get utf-8 encoding to work!! I havetext/html;charset=UTF-8 in my request handler, and input.encoding=UTF-8 output.encoding=UTF-8 in velocity.properties, in various locations (I may have the wrong ones! at least in the folder where the .vm files reside) What else should I be

Re: UTF-8 support during indexing content

2012-02-01 Thread Chris Hostetter
: Subject: UTF-8 support during indexing content : References: <8ce9f966c6f6769-19a0-9e...@webmail-m069.sysops.aol.com> : <1326447127.1952.10.camel@snape> : <8ceade0f7e0ecec-189c-c...@webmail-m069.sysops.aol.com> : <1328105200.2033.33.camel@snape> : In-Reply-To: <132

RE: UTF-8 support during indexing content

2012-02-01 Thread Van Tassell, Kristian
Travis and all, This is solved and was not directly a Solr issue. I'll note the solution here in case anyone makes the same mistake. The documents are UTF-8 and the source documents are converted via XSLT. They look good up to that point. First off, based off of of some other recommenda

Re: UTF-8 support during indexing content

2012-02-01 Thread Travis Low
Are you sure the input document is in UTF-8? That looks like classic ISO-8859-1-treated-as-UTF-8. How did you confirm the document contains the right quote marks immediately prior to uploading? If you just visually inspected it, then use whatever tool you viewed it in to see what the character

UTF-8 support during indexing content

2012-02-01 Thread Van Tassell, Kristian
Hello everyone, I have a question that I imagine has been asked many times before, so I apologize for the repeat. I have a basic text field with the following text: the word ”stemming” in quotes Uploading the data yields no errors, however when it is indexed, the text looks like this:

Re: form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-11-02 Thread kgoess
I finally managed to answer my own question. UTF-8 data in the body is ok, but you need to specify charset=utf-8 in the Content-Type header in each part, to tell the receiver (Solr) that it's not the default ISO-8859-1 Content-Disposition: form-data; name=literal.bptitle Content-Type:

form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-10-28 Thread kgoess
I'm trying to post a PDF along with a whole bunch of metadata fields to the ExtractingRequestHandler as multipart/form-data. It works fine except for the utf-8 character handling. Here is what my post looks like (abridged): POST /solr/update/extract HTTP/1.1 TE: deflate,gzip;

Re: Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-28 Thread Ravish Bhagdev
Thanks Chris. Yes, changing connector settings not just in solr but also in all webapps that were sending queries into it solved the problem! Appreciate the help. R On Tue, Sep 13, 2011 at 6:11 PM, Chris Hostetter wrote: > > : Any idea why solr is unable to return the pound sign as-is? > : > :

Re: Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-13 Thread Chris Hostetter
: Any idea why solr is unable to return the pound sign as-is? : : I tried typing in £ 1 million in Solr admin GUI and got following response. ... : £ 1 million ... : Here is my Java Properties I got also from admin interface: ... : catalina.home = : /home/rbhagdev/SCCRepo

Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-11 Thread Ravish Bhagdev
classworlds.conf = /usr/share/maven2/bin/m2.conf sun.jnu.encoding = UTF-8 java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
SolrJ 3.1? Anything else on the > >>> Nutch part i should have taken care off? > >>> > >>> Thanks! > >>> > >>> > >>> Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute > >>> INFO: [] webapp=/solr path=/updat

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
ath=/update params={wt=javabin&version=2} > > status=500 QTime=423 Jun 27, 2011 10:24:28 AM > > org.apache.solr.common.SolrException log > > SEVERE: java.lang.RuntimeException: [was class > > java.io.CharConversionException] Invalid UTF-8 character 0x at char > &

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
core.SolrCore execute > INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 > QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException > log > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException]

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
en care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.Cha

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
O: [] webapp=/solr path=/update params={wt=javabin&version=2} > >> status=500 QTime=423 Jun 27, 2011 10:24:28 AM > >> org.apache.solr.common.SolrException log SEVERE: > >> java.lang.RuntimeException: [was class java.io.CharConversionException] > >> Invalid U

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread lee carroll
[] webapp=/solr path=/update params={wt=javabin&version=2} status=500 >> QTime=423 >> Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log >> SEVERE: java.lang.RuntimeException: [was class >> java.io.CharConversionException] Invalid UTF-8 character 0x at char >> #114203

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread ramires
hı Its the same error I mentioned here http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html. Also if you use solr 1.4.1 there is no problem like that. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-1-indexing-error-Invalid-UTF-8-character

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Thomas Fischer
atus=500 > QTime=423 > Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 character 0x at char > #1142033, byte #1155068) > at > com.ctc.wstx.util.Excep

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
OK - re-reading your message it seems maybe that is what you were trying to say too, Robert. FWIW I agree with you that XML is rigid, sometimes for purely arbitrary reasons. But nobody has really helped Markus here - unfortunately, there is no easy way out of this mess. What I do to handle i

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://ji

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: > > correct!!! > but what i said, is totally different than what you said. you are still wrong.

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:35, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling wrote: Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right. But I was saying that UTF-8 0x (which is byte sequence "ff ff") is illegal

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling wrote: > Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is right. > > But I was saying that UTF-8 0x (which is byte sequence "ff ff") is > illegal > and that's what the java.io.CharConversionExcept

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
Am 27.06.2011 14:02, schrieb Robert Muir: On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling wrote: So there is no UTF-8 0x. It is illegal. you are wrong: it is legally encoded as a three byte sequence: ef bf bf Unicode U+ ist UTF-8 byte sequence "ef bf bf" that is rig

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Robert Muir
On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling wrote: > > So there is no UTF-8 0x. It is illegal. > you are wrong: it is legally encoded as a three byte sequence: ef bf bf

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Bernd Fehling
I suggest avoid illegal UTF-8 characters by pre-filtering your contentstream before loading. Unicode UTF-8(hex) U+07FFdf bf U+0800e0 a0 80 So there is no UTF-8 0x. It is illegal. Regards Am 27.06.2011 12:40, schrieb Markus Jelsma: Hi, I came across the indexing error below

Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Markus Jelsma
[was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)

strange utf-8 problem

2011-06-22 Thread ramires
-1.4.0.jar for solr 1.4.1 becouse of javabin errors. here is problematic chars. "Sao Tom���nd Princip���STP" SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #681112, byte #700315)

utf-8 is not so much everywhere

2011-03-26 Thread Paul Libbrecht
Hello group, this message is a word of warning and plea to wiki writers. Reading the wiki and documentation in general, there seems to be an accepted consensus that most in SOLR is working in utf-8. To my opinion this is absolutely good. But this may be a remain of the times. Several efforts

Re: Does solr supports indexing of files other than UTF-8

2011-01-28 Thread Yonik Seeley
On Thu, Jan 27, 2011 at 3:51 AM, prasad deshpande wrote: > The size of docs can be huge, like suppose there are 800MB pdf file to index > it I need to translate it in UTF-8 and then send this file to index. PDF is binary AFAIK... you shouldn't need to do any charset translation before

  1   2   3   >