it appears there is an encoding problem, in the screenshot I can see the title is mangled, and if i open up the URL in IE or firefox, both browsers think it is iso-8859-1.
I think this is why (from w3c validator): Character Encoding mismatch! The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the <meta> element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation. On Wed, Jul 29, 2009 at 6:02 PM, ashokc<ash...@qualcomm.com> wrote: > > Sure. > > The java command I use with TIKA to extract text from a URL is: > > java -jar tika-0.3-standalone.jar -t $url > > I have also attached the screenshots of the web page, post documents > produced in the two different ways (Perl & Tika) for that web page, and the > screenshots of the search result for a string contained in that web page. > The index in each case contains just this one URL. To keep everything else > identical, I used the same instance for creating the index in each case. > First I posted the Tika document, checked for the results, emptied the > index, posted the Perl document, and checked the results. > > Debug query for Tika: > > <str name="parsedquery"> > +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ > 的优质多媒体内容能^2.0 > | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | > content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ > é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) () > </str> > > Debug query for Perl: > > <str name="parsedquery"> > +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ > 的优质多媒体内容能^2.0 > | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 | > content_china:"高通 通公 å…¬å ¸ å ¸å±• 展现 现了 了海 æµ·é‡ > é‡ çš„ 的优 优质 质多 多媒 媒体 体内 内容 容能")~0.01) () > </str> > > The screenshots > http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx > > Perl extracted doc > http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml > > Tika extracted doc > http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml > > > Grant Ingersoll-6 wrote: >> >> Hmm, looks very much like an encoding problem. Can you post a sample >> showing it, along with the commands you invoked? >> >> Thanks, >> Grant >> >> On Jul 28, 2009, at 6:14 PM, ashokc wrote: >> >>> >>> I am finding that the search results based on indexing Tika >>> extracted text >>> are very different from results based on indexing the text extracted >>> via >>> other means. This shows up for example with a chinese web site that >>> I am >>> trying to index. >>> >>> I created the documents (for posting to SOLR) in two ways. The >>> source text >>> of the web pages are full of html entities like 〹 and some >>> english >>> characters mixed in. >>> >>> (a) Simple text extraction from the page source by a Perl script. The >>> resulting content field looks like >>> >>> <field name="content_china">Who We Are >>> 公司历史 >>> 您的成功案例 >>> 领导团队 业务部门 >>> Innovation >>> 创 etc... </field> >>> >>> I posted these documents to a SOLR instance >>> >>> (b) Used Tika (command line). The resulting content field looks like >>> >>> <field name="content_china">Who We Are Ã¥ ¬å ¸à >>> ¥ÂŽÂ†Ã¥Â ² >>> 您的戠功æ¡ >>> ˆä¾‹ 领导团队 >>> 业务部门  Innovation à >>> ¥Â >>> etc... </field> >>> >>> I posted these documents to a different instance >>> >>> When I search the first instance for a string (that I copied & >>> pasted from >>> the web site) I find a number of hits, including the page from which I >>> copied the string from. But when I do the same on the instance with >>> Tika >>> extracted text - I get nothing. >>> >>> Has anyone seen this? I believe it may have to do with encoding. In >>> both >>> cases the posted documents were utf-8 compiant. >>> >>> Thanks for your insights. >>> >>> - ashok >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >> > > -- > View this message in context: > http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Robert Muir rcm...@gmail.com