Re: Problem adding unicoded docs to Solr through SolrJ

ahmed baseet Wed, 29 Apr 2009 22:58:53 -0700

Thanks a lot for your quick and detailed response.
I got the point. But as I've mentioned earlier I've  a string of
rawtext[default encoding] that needs to be encoded in utf-8, so I tried
something stupid but working though. I first converted the whole string to
byte array and then used that byte array to create a new utf-8 encoded sting
like this,


// Encode in Unicode UTF-8
                byte [] utfEncodeByteArray = textOnly.getBytes();
                String utfString = new String(utfEncodeByteArray,
Charset.forName("UTF-8"));

then passed the utfString to the function for posting to Solr and it works
prefectly.
But is there any intelligent way of doing all this, like straight from
default encoded string to utf-8 encoded string, without going via byte
array.
Thank you very much.

--Ahmed.



On Wed, Apr 29, 2009 at 6:45 PM, Michael Ludwig <m...@as-guides.com> wrote:

> ahmed baseet schrieb:
>
>  public void postToSolrUsingSolrj(String rawText, String pageId) {
>>
>
>             doc.addField("features", rawText );
>>
>
>  In the above the param rawText is just the html stripped off of all
>> its tags, js, css etc and pageId is the Url for that page. When I'm
>> using this for English pages its working perfectly fine but the
>> problem comes up when I'm trying to index some non-english pages.
>>
>
> Maybe you're constructing a string without specifying the encoding, so
> Java uses your default platform encoding?
>
> String(byte[] bytes)
>  Constructs a new String by decoding the specified array of
>  bytes using the platform's default charset.
>
> String(byte[] bytes, Charset charset)
>  Constructs a new String by decoding the specified array of bytes using
>  the specified charset.
>
>  Now what I did is just extracted the raw text from that html page and
>> manually created an xml page like this
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <add>
>>  <doc>
>>    <field name="id">UTF2TEST</field>
>>    <field name="name">Test with some UTF-8 encoded characters</field>
>>    <field name="features">*some tamil unicode text here*</field>
>>   </doc>
>> </add>
>>
>> and posted this from command line using the post.jar file. Now searching
>> gives me the result but unlike last time browser shows the indexed text in
>> tamil itself and not the raw unicode.
>>
>
> Now that's perfect, isn't it?
>
>  I tried doing something like this also,
>>
>
>  // Encode in Unicode UTF-8
>>  utfEncodedText = new String(rawText.getBytes("UTF-8"));
>>
>> but even this didn't help eighter.
>>
>
> No encoding specified, so the default platform encoding is used, which
> is likely not what you want. Consider the following example:
>
> package milu;
> import java.nio.charset.Charset;
> public class StringAndCharset {
>  public static void main(String[] args) {
>    byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
>    System.out.println(Charset.defaultCharset().displayName());
>    System.out.println(new String(bytes));
>    System.out.println(new String(bytes,  Charset.forName("UTF-8")));
>  }
> }
>
> Output:
>
> windows-1252
> KÃ¤se (bad)
> Käse (good)
>
> Michael Ludwig
>

Re: Problem adding unicoded docs to Solr through SolrJ

Reply via email to