Re: Indexing TIKA extracted text. Are there some issues?

Robert Muir Wed, 29 Jul 2009 15:23:00 -0700

it appears there is an encoding problem, in the screenshot I can see
the title is mangled, and if i open up the URL in IE or firefox, both
browsers think it is iso-8859-1.


I think this is why (from w3c validator):

Character Encoding mismatch!

The character encoding specified in the HTTP header (iso-8859-1) is
different from the value in the <meta> element (utf-8). I will use the
value from the HTTP header (iso-8859-1) for this validation.

On Wed, Jul 29, 2009 at 6:02 PM, ashokc<ash...@qualcomm.com> wrote:
>
> Sure.
>
> The java command I use with TIKA to extract text from a URL is:
>
> java -jar tika-0.3-standalone.jar -t $url
>
> I have also attached the screenshots of the web page, post documents
> produced in the two different ways (Perl & Tika) for that web page, and the
> screenshots of the search result for a string contained in that web page.
> The index in each case contains just this one URL. To keep everything else
> identical, I used the same instance for creating the index in each case.
> First I posted the Tika document, checked for the results, emptied the
> index, posted the Perl document, and checked the results.
>
> Debug query for Tika:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> </str>
>
> Debug query for Perl:
>
> <str name="parsedquery">
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> </str>
>
> The screenshots
> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>
> Perl extracted doc
> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>
> Tika extracted doc
> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hmm, looks very much like an encoding problem.  Can you post a sample
>> showing it, along with the commands you invoked?
>>
>> Thanks,
>> Grant
>>
>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>
>>>
>>> I am finding that the search results based on indexing Tika
>>> extracted text
>>> are very different from results based on indexing the text extracted
>>> via
>>> other means. This shows up for example with a chinese web site that
>>> I am
>>> trying to index.
>>>
>>> I created the documents (for posting to SOLR) in two ways. The
>>> source text
>>> of the web pages are full of html entities like &#12345; and some
>>> english
>>> characters mixed in.
>>>
>>> (a) Simple text extraction from the page source by a Perl script. The
>>> resulting content field looks like
>>>
>>> <field name="content_china">Who We Are
>>> &#20844;&#21496;&#21382;&#21490;
>>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;
>>> Innovation
>>> &#21019; etc...     </field>
>>>
>>> I posted these documents to a SOLR instance
>>>
>>> (b) Used Tika (command line). The resulting content field looks like
>>>
>>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
>>> ¥ÂŽÂ†Ã¥Â Â²
>>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
>>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
>>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
>>> ¥Â
>>> etc... </field>
>>>
>>> I posted these documents to a different instance
>>>
>>> When I search the first instance for a string (that I copied &
>>> pasted from
>>> the web site) I find a number of hits, including the page from which I
>>> copied the string from. But when I do the same on the instance with
>>> Tika
>>> extracted text - I get nothing.
>>>
>>> Has anyone seen this? I believe it may have to do with encoding. In
>>> both
>>> cases the posted documents were utf-8 compiant.
>>>
>>> Thanks for your insights.
>>>
>>> - ashok
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Robert Muir
rcm...@gmail.com

Re: Indexing TIKA extracted text. Are there some issues?

Reply via email to