subject:"verifying that an index contains ONLY utf\-8"

RE: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind

M To: solr-user@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Paul

Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that "reindexes", i.e., reads all documents out of one index, does some transfor

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Robert Muir

On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind wrote: > > There are various packages of such heuristic algorithms to guess char > encoding, I wouldn't try to write my own. icu4j might include such an > algorithm, not sure. > it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Chars

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Michael McCandless

The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so the first place where invalid UTF8 is detected/corrected/etc. is during your analysis process, which takes your raw content and produces char[] based tokens. Second, during indexing, Lucene ensures that the incoming char[] toke

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Jonathan Rochkind

Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8. There is no

Re: verifying that an index contains ONLY utf-8

2011-01-13 Thread Peter Karich

take a look also into icu4j which is one of the contrib projects ... > converting on the fly is not supported by Solr but should be relative > easy in Java. > Also scanning is relative simple (accept only a range). Detection too: > http://www.mozilla.org/projects/intl/chardet.html > >> We've crea

Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Peter Karich

converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html > We've created an index from a number of different documents that are > supplied by third p

Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Markus Jelsma

This is supposed to be dealt with outside the index. All input must be UTF-8 encoded. Failing to do so will give unexpected results. > We've created an index from a number of different documents that are > supplied by third parties. We want the index to only contain UTF-8 > encoded characters. I

verifying that an index contains ONLY utf-8

2011-01-12 Thread Paul

We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that th

RE: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

Re: verifying that an index contains ONLY utf-8

verifying that an index contains ONLY utf-8

9 matches

Site Navigation

Mail list logo

Footer information