2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file

You might find the utf8_encode & utf8_decode php functions useful,
http://nz2.php.net/utf8_encode
http://nz2.php.net/utf8_decode

$utf8string = utf8_encode($row['column']);

-Nick

On 6/10/07, Ken Krugler <[EMAIL PROTECTED]> wrote:
>This is how the whole process looks like -
>
>1. I have a web page that I want to index. So I first copy that web page,
>breaking it down to different section, and store it in mysql into different
>column
>2. I then wrote a small PHP script that draw all the value from all the
>fields from mysql and then write it into an xml file
>3. I then use solr to index this xml file, and the error that appears half
>way during indexing is - "FATAL: Connection error (is Solr running at
>http://localhost/solr/update
>?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
>http://local/solr/update";
>4.Although the error code doesnt specify is XML utf-8 code error, but I did
>a bit research, and look at the XML file that i have, it doesn't fulfill the
>utf-8 encoding
>
>I have been trying these for couple of hours, but still to no avail. I would
>like to find out
>1. How to know the webpage that I copy into my mysql is what coding?

The charset can be in the response header, and/or the meta tags for
the page. See
http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
for code used by Nutch for this.

Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page
isn't an easy one. If you have some way of doing analysis in advance
such that you know for sure it's always X, that's going to simplify
things for you.

>2. at what point of this whole process should I convert it to UTF-8?

As soon as possible - which means right when you're processing the page.

>I tried
>change the collation in mysql for all the columns to UTF-8 from
>latin1-swedish, but it still doesnt work

Collation settings in the DB change how the DB interprets the data,
but it doesn't change the data itself.

-- Ken


>On 6/9/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>
>>>  Thought this is not directly related to Solr, but I have a XML output
>>from
>>>  mysql database, but during indexing the XML output is not working. And
>>the
>>>  problem is part of the XML output is not in UTF-8 encoding, how can I
>>>  convert it to UTF-8 and how do I know what kind of coding it uses in the
>>>  first place (the data I export from the mysql database). Thanks!
>>
>>How do you generate XML output? "Output" itself is usually a raw byte
>>array, it uses "Transport" and "Encoding". If you save it in a file
>>system and forget about "transport-layer-encoding" you will get some
>>new problems...
>>
>>>  during indexing the XML output is not working
>>- what exactly happens, which kind of error messages?


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to