it appears there is an encoding problem, in the screenshot I can see
the title is mangled, and if i open up the URL in IE or firefox, both
browsers think it is iso-8859-1.

I think this is why (from w3c validator):

Character Encoding mismatch!

The character encoding specified in the HTTP header (iso-8859-1) is
different from the value in the <meta> element (utf-8). I will use the
value from the HTTP header (iso-8859-1) for this validation.

On Wed, Jul 29, 2009 at 6:02 PM, ashokc<ash...@qualcomm.com> wrote:
>
> Sure.
>
> The java command I use with TIKA to extract text from a URL is:
>
> java -jar tika-0.3-standalone.jar -t $url
>
> I have also attached the screenshots of the web page, post documents
> produced in the two different ways (Perl & Tika) for that web page, and the
> screenshots of the search result for a string contained in that web page.
> The index in each case contains just this one URL. To keep everything else
> identical, I used the same instance for creating the index in each case.
> First I posted the Tika document, checked for the results, emptied the
> index, posted the Perl document, and checked the results.
>
> Debug query for Tika:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> 的优质多媒体内容能^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡
> é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) ()
> </str>
>
> Debug query for Perl:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> 的优质多媒体内容能^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡
> é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) ()
> </str>
>
> The screenshots
> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>
> Perl extracted doc
> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>
> Tika extracted doc
> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hmm, looks very much like an encoding problem.  Can you post a sample
>> showing it, along with the commands you invoked?
>>
>> Thanks,
>> Grant
>>
>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>
>>>
>>> I am finding that the search results based on indexing Tika
>>> extracted text
>>> are very different from results based on indexing the text extracted
>>> via
>>> other means. This shows up for example with a chinese web site that
>>> I am
>>> trying to index.
>>>
>>> I created the documents (for posting to SOLR) in two ways. The
>>> source text
>>> of the web pages are full of html entities like &#12345; and some
>>> english
>>> characters mixed in.
>>>
>>> (a) Simple text extraction from the page source by a Perl script. The
>>> resulting content field looks like
>>>
>>> <field name="content_china">Who We Are
>>> &#20844;&#21496;&#21382;&#21490;
>>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;
>>> Innovation
>>> &#21019; etc...     </field>
>>>
>>> I posted these documents to a SOLR instance
>>>
>>> (b) Used Tika (command line). The resulting content field looks like
>>>
>>> <field name="content_china">Who We Are Ã¥ ¬å ¸Ã
>>> ¥ÂŽÂ†Ã¥Â ²
>>> 您的戠功æ¡
>>> ˆä¾‹ 领导团队
>>> 业务部门 Â Innovation Ã
>>> ¥Â
>>> etc... </field>
>>>
>>> I posted these documents to a different instance
>>>
>>> When I search the first instance for a string (that I copied &
>>> pasted from
>>> the web site) I find a number of hits, including the page from which I
>>> copied the string from. But when I do the same on the instance with
>>> Tika
>>> extracted text - I get nothing.
>>>
>>> Has anyone seen this? I believe it may have to do with encoding. In
>>> both
>>> cases the posted documents were utf-8 compiant.
>>>
>>> Thanks for your insights.
>>>
>>> - ashok
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to