Indexing TIKA extracted text. Are there some issues?

ashokc Tue, 28 Jul 2009 15:15:32 -0700

I am finding that the search results based on indexing Tika extracted text
are very different from results based on indexing the text extracted via
other means. This shows up for example with a chinese web site that I am
trying to index.


I created the documents (for posting to SOLR) in two ways. The source text
of the web pages are full of html entities like &#12345; and some english
characters mixed in.

(a) Simple text extraction from the page source by a Perl script. The
resulting content field looks like

<field name="content_china">Who We Are &#20844;&#21496;&#21382;&#21490;
&#24744;&#30340;&#25104;&#21151;&#26696;&#20363;
&#39046;&#23548;&#22242;&#38431; &#19994;&#21153;&#37096;&#38376; Innovation
&#21019; etc...     </field>

I posted these documents to a SOLR instance

(b) Used Tika (command line). The resulting content field looks like

<field name="content_china">Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã¥ÂŽÂ†Ã¥ÂÂ²
Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã¥Â
etc... </field>

I posted these documents to a different instance

When I search the first instance for a string (that I copied & pasted from
the web site) I find a number of hits, including the page from which I
copied the string from. But when I do the same on the instance with Tika
extracted text - I get nothing.

Has anyone seen this? I believe it may have to do with encoding. In both
cases the posted documents were utf-8 compiant.

Thanks for your insights.

- ashok

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing TIKA extracted text. Are there some issues?

Reply via email to