Re: Indexing TIKA extracted text. Are there some issues?

ashokc Wed, 29 Jul 2009 15:55:45 -0700

Could very well be... I will rectify it and try again. Thanks

- ashok




Robert Muir wrote:
> 
> it appears there is an encoding problem, in the screenshot I can see
> the title is mangled, and if i open up the URL in IE or firefox, both
> browsers think it is iso-8859-1.
> 
> I think this is why (from w3c validator):
> 
> Character Encoding mismatch!
> 
> The character encoding specified in the HTTP header (iso-8859-1) is
> different from the value in the <meta> element (utf-8). I will use the
> value from the HTTP header (iso-8859-1) for this validation.
> 
> On Wed, Jul 29, 2009 at 6:02 PM, ashokc<[email protected]> wrote:
>>
>> Sure.
>>
>> The java command I use with TIKA to extract text from a URL is:
>>
>> java -jar tika-0.3-standalone.jar -t $url
>>
>> I have also attached the screenshots of the web page, post documents
>> produced in the two different ways (Perl & Tika) for that web page, and
>> the
>> screenshots of the search result for a string contained in that web page.
>> The index in each case contains just this one URL. To keep everything
>> else
>> identical, I used the same instance for creating the index in each case.
>> First I posted the Tika document, checked for the results, emptied the
>> index, posted the Perl document, and checked the results.
>>
>> Debug query for Tika:
>>
>> <str name="parsedquery">
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> </str>
>>
>> Debug query for Perl:
>>
>> <str name="parsedquery">
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> </str>
>>
>> The screenshots
>> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>>
>> Perl extracted doc
>> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>>
>> Tika extracted doc
>> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Hmm, looks very much like an encoding problem.  Can you post a sample
>>> showing it, along with the commands you invoked?
>>>
>>> Thanks,
>>> Grant
>>>
>>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>>
>>>>
>>>> I am finding that the search results based on indexing Tika
>>>> extracted text
>>>> are very different from results based on indexing the text extracted
>>>> via
>>>> other means. This shows up for example with a chinese web site that
>>>> I am
>>>> trying to index.
>>>>
>>>> I created the documents (for posting to SOLR) in two ways. The
>>>> source text
>>>> of the web pages are full of html entities like &#12345; and some
>>>> english
>>>> characters mixed in.
>>>>
>>>> (a) Simple text extraction from the page source by a Perl script. The
>>>> resulting content field looks like
>>>>
>>>> <field name="content_china">Who We Are
>>>> &#20844;&#21496;&#21382;&#21490;
>>>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>>>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;
>>>> Innovation
>>>> &#21019; etc...     </field>
>>>>
>>>> I posted these documents to a SOLR instance
>>>>
>>>> (b) Used Tika (command line). The resulting content field looks like
>>>>
>>>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
>>>> ¥ÂŽÂ†Ã¥Â Â²
>>>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
>>>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
>>>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
>>>> ¥Â
>>>> etc... </field>
>>>>
>>>> I posted these documents to a different instance
>>>>
>>>> When I search the first instance for a string (that I copied &
>>>> pasted from
>>>> the web site) I find a number of hits, including the page from which I
>>>> copied the string from. But when I do the same on the instance with
>>>> Tika
>>>> extracted text - I get nothing.
>>>>
>>>> Has anyone seen this? I believe it may have to do with encoding. In
>>>> both
>>>> cases the posted documents were utf-8 compiant.
>>>>
>>>> Thanks for your insights.
>>>>
>>>> - ashok
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> [email protected]
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing TIKA extracted text. Are there some issues?

Reply via email to