Sure. The java command I use with TIKA to extract text from a URL is:
java -jar tika-0.3-standalone.jar -t $url I have also attached the screenshots of the web page, post documents produced in the two different ways (Perl & Tika) for that web page, and the screenshots of the search result for a string contained in that web page. The index in each case contains just this one URL. To keep everything else identical, I used the same instance for creating the index in each case. First I posted the Tika document, checked for the results, emptied the index, posted the Perl document, and checked the results. Debug query for Tika: <str name="parsedquery"> +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:"é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½")~0.01) () </str> Debug query for Perl: <str name="parsedquery"> +DisjunctionMaxQuery((urltext:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | title:é«éå ¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½å 容è½^2.0 | content_china:"é«é éå ¬ å ¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å å 容 容è½")~0.01) () </str> The screenshots http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx Perl extracted doc http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml Tika extracted doc http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml Grant Ingersoll-6 wrote: > > Hmm, looks very much like an encoding problem. Can you post a sample > showing it, along with the commands you invoked? > > Thanks, > Grant > > On Jul 28, 2009, at 6:14 PM, ashokc wrote: > >> >> I am finding that the search results based on indexing Tika >> extracted text >> are very different from results based on indexing the text extracted >> via >> other means. This shows up for example with a chinese web site that >> I am >> trying to index. >> >> I created the documents (for posting to SOLR) in two ways. The >> source text >> of the web pages are full of html entities like 〹 and some >> english >> characters mixed in. >> >> (a) Simple text extraction from the page source by a Perl script. The >> resulting content field looks like >> >> <field name="content_china">Who We Are >> 公司历史 >> 您的成功案例 >> 领导团队 业务部门 >> Innovation >> 创 etc... </field> >> >> I posted these documents to a SOLR instance >> >> (b) Used Tika (command line). The resulting content field looks like >> >> <field name="content_china">Who We Are Ã¥ ŒÂ¸à >> ¥ÂŽÂ†Ã¥Â² >> 您的æˆÂ功æ¡ >> ˆä¾‹ 领导团队 >> 业务部门  Innovation à >> ¥Â >> etc... </field> >> >> I posted these documents to a different instance >> >> When I search the first instance for a string (that I copied & >> pasted from >> the web site) I find a number of hits, including the page from which I >> copied the string from. But when I do the same on the instance with >> Tika >> extracted text - I get nothing. >> >> Has anyone seen this? I believe it may have to do with encoding. In >> both >> cases the posted documents were utf-8 compiant. >> >> Thanks for your insights. >> >> - ashok >> >> -- >> View this message in context: >> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > > -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html Sent from the Solr - User mailing list archive at Nabble.com.