Could very well be... I will rectify it and try again. Thanks - ashok
Robert Muir wrote: > > it appears there is an encoding problem, in the screenshot I can see > the title is mangled, and if i open up the URL in IE or firefox, both > browsers think it is iso-8859-1. > > I think this is why (from w3c validator): > > Character Encoding mismatch! > > The character encoding specified in the HTTP header (iso-8859-1) is > different from the value in the <meta> element (utf-8). I will use the > value from the HTTP header (iso-8859-1) for this validation. > > On Wed, Jul 29, 2009 at 6:02 PM, ashokc<ash...@qualcomm.com> wrote: >> >> Sure. >> >> The java command I use with TIKA to extract text from a URL is: >> >> java -jar tika-0.3-standalone.jar -t $url >> >> I have also attached the screenshots of the web page, post documents >> produced in the two different ways (Perl & Tika) for that web page, and >> the >> screenshots of the search result for a string contained in that web page. >> The index in each case contains just this one URL. To keep everything >> else >> identical, I used the same instance for creating the index in each case. >> First I posted the Tika document, checked for the results, emptied the >> index, posted the Perl document, and checked the results. >> >> Debug query for Tika: >> >> <str name="parsedquery"> >> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ >> 的优质多媒体内容能^2.0 >> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | >> content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ >> é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) () >> </str> >> >> Debug query for Perl: >> >> <str name="parsedquery"> >> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ >> 的优质多媒体内容能^2.0 >> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | >> content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ >> é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) () >> </str> >> >> The screenshots >> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx >> >> Perl extracted doc >> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml >> >> Tika extracted doc >> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml >> >> >> Grant Ingersoll-6 wrote: >>> >>> Hmm, looks very much like an encoding problem. Can you post a sample >>> showing it, along with the commands you invoked? >>> >>> Thanks, >>> Grant >>> >>> On Jul 28, 2009, at 6:14 PM, ashokc wrote: >>> >>>> >>>> I am finding that the search results based on indexing Tika >>>> extracted text >>>> are very different from results based on indexing the text extracted >>>> via >>>> other means. This shows up for example with a chinese web site that >>>> I am >>>> trying to index. >>>> >>>> I created the documents (for posting to SOLR) in two ways. The >>>> source text >>>> of the web pages are full of html entities like 〹 and some >>>> english >>>> characters mixed in. >>>> >>>> (a) Simple text extraction from the page source by a Perl script. The >>>> resulting content field looks like >>>> >>>> <field name="content_china">Who We Are >>>> 公司历史 >>>> 您的成功案例 >>>> 领导团队 业务部门 >>>> Innovation >>>> 创 etc... </field> >>>> >>>> I posted these documents to a SOLR instance >>>> >>>> (b) Used Tika (command line). The resulting content field looks like >>>> >>>> <field name="content_china">Who We Are Ã¥ ¬å ¸à >>>> ¥ÂŽÂ†Ã¥Â ² >>>> 您的戠功æ¡ >>>> ˆä¾‹ 领导团队 >>>> 业务部门  Innovation à >>>> ¥Â >>>> etc... </field> >>>> >>>> I posted these documents to a different instance >>>> >>>> When I search the first instance for a string (that I copied & >>>> pasted from >>>> the web site) I find a number of hits, including the page from which I >>>> copied the string from. But when I do the same on the instance with >>>> Tika >>>> extracted text - I get nothing. >>>> >>>> Has anyone seen this? I believe it may have to do with encoding. In >>>> both >>>> cases the posted documents were utf-8 compiant. >>>> >>>> Thanks for your insights. >>>> >>>> - ashok >>>> >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>> using Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > -- View this message in context: http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html Sent from the Solr - User mailing list archive at Nabble.com.