Re: Indexing TIKA extracted text. Are there some issues?

ashokc Wed, 29 Jul 2009 15:03:12 -0700

Sure.

The java command I use with TIKA to extract text from a URL is:


java -jar tika-0.3-standalone.jar -t $url

I have also attached the screenshots of the web page, post documents
produced in the two different ways (Perl & Tika) for that web page, and the
screenshots of the search result for a string contained in that web page.
The index in each case contains just this one URL. To keep everything else
identical, I used the same instance for creating the index in each case.
First I posted the Tika document, checked for the results, emptied the
index, posted the Perl document, and checked the results.

Debug query for Tika:

<str name="parsedquery">
+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()
</str>

Debug query for Perl:

<str name="parsedquery">
+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()
</str>

The screenshots
http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx 

Perl extracted doc
http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml 

Tika extracted doc
http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml 


Grant Ingersoll-6 wrote:
> 
> Hmm, looks very much like an encoding problem.  Can you post a sample  
> showing it, along with the commands you invoked?
> 
> Thanks,
> Grant
> 
> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
> 
>>
>> I am finding that the search results based on indexing Tika  
>> extracted text
>> are very different from results based on indexing the text extracted  
>> via
>> other means. This shows up for example with a chinese web site that  
>> I am
>> trying to index.
>>
>> I created the documents (for posting to SOLR) in two ways. The  
>> source text
>> of the web pages are full of html entities like &#12345; and some  
>> english
>> characters mixed in.
>>
>> (a) Simple text extraction from the page source by a Perl script. The
>> resulting content field looks like
>>
>> <field name="content_china">Who We Are  
>> &#20844;&#21496;&#21382;&#21490;
>> &#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
>> &#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376;  
>> Innovation
>> &#21019; etc...     </field>
>>
>> I posted these documents to a SOLR instance
>>
>> (b) Used Tika (command line). The resulting content field looks like
>>
>> <field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã 
>> ¥ÂŽÂ†Ã¥ÂÂ²
>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ  
>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã 
>> ¥Â
>> etc... </field>
>>
>> I posted these documents to a different instance
>>
>> When I search the first instance for a string (that I copied &  
>> pasted from
>> the web site) I find a number of hits, including the page from which I
>> copied the string from. But when I do the same on the instance with  
>> Tika
>> extracted text - I get nothing.
>>
>> Has anyone seen this? I believe it may have to do with encoding. In  
>> both
>> cases the posted documents were utf-8 compiant.
>>
>> Thanks for your insights.
>>
>> - ashok
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing TIKA extracted text. Are there some issues?

Reply via email to