Re: Indexing TIKA extracted text. Are there some issues?

Grant Ingersoll Tue, 28 Jul 2009 17:52:59 -0700

Hmm, looks very much like an encoding problem. Can you post a sampleshowing it, along with the commands you invoked?


Thanks,
Grant


On Jul 28, 2009, at 6:14 PM, ashokc wrote:

I am finding that the search results based on indexing Tikaextracted textare very different from results based on indexing the text extractedviaother means. This shows up for example with a chinese web site thatI am
trying to index.
I created the documents (for posting to SOLR) in two ways. Thesource textof the web pages are full of html entities like 〹 and someenglish
characters mixed in.

(a) Simple text extraction from the page source by a Perl script. The
resulting content field looks like
<field name="content_china">Who We Are公司历史
&#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
领导团队 业务部门Innovation
&#21019; etc...     </field>

I posted these documents to a SOLR instance

(b) Used Tika (command line). The resulting content field looks like
<field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã¥ÂŽÂ†Ã¥ÂÂ²
Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸÃ¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã¥Â
etc... </field>

I posted these documents to a different instance
When I search the first instance for a string (that I copied &pasted from
the web site) I find a number of hits, including the page from which I
copied the string from. But when I do the same on the instance withTika
extracted text - I get nothing.
Has anyone seen this? I believe it may have to do with encoding. Inboth
cases the posted documents were utf-8 compiant.

Thanks for your insights.

- ashok

--
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
Sent from the Solr - User mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Indexing TIKA extracted text. Are there some issues?

Reply via email to