Thank you for your reply. I had the assumption Tika could also extract text content from various documenttypes instead of only meta data. I'll use the CLI tools from http://www.foolabs.com/xpdf/ to extract text manually.
- Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2009-11-16 at 12:06 +0100, Antonio Calò wrote: > What I could try to say is that if you want to index a Pdf, then you should > use a Pdf extractor. A Pdf Extractor is able to extract the text content and > the metadata of the files. I suppose you have just opened and indexed the > pdf as is. So you stored bynary data and stop. For my applciation I've used > PdfExtractor, but also pdfBox project could be used. > > Antonio > > 2009/11/16 Markus Jelsma - Buyways B.V. <mar...@buyways.nl> > > > Anyone has a clue? > > > > > > > > > List, > > > > > > > > > I somehow fail to index certain pdf files using the > > > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but > > > modified schema. I have a very simple schema for this case using only > > > and ID field, a timestamp field and two dynamic fields; ignored_* and > > > attr_* both indexed, stored and multivalued strings. They are > > > multivalued simple because some HTML files fail when storing multiple > > > hyperlinks. > > > > > > I have posted multiple files to > > > http://.../update/extract?literal.id=doc1 including: > > > 1. the whitepaper at > > > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP > > > 2. the html file of the frontpage of http://nu.nl/ > > > 3. another pdf at > > > > > http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A> > > > > > > For each document i have a corresponding select/?q=*:*: > > > > > > > > > 1. No text? Should i see something? > > > > > > <doc><str name="id">doc1</str> > > > <arr name="ignored_content_type"> > > > <str>application/octet-stream</str> > > > </arr> > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------cf57b4ad644d > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>491238</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:17:23.016Z</date> > > > </doc> > > > > > > > > > 2. Plenty of data, this seems to be ok > > > > > > <doc> > > > <str name="id">doc1</str> > > > <arr name="ignored_content_type"> > > > <str>application/xhtml+xml</str> > > > </arr> > > > <arr name="ignored_links"> > > > <str>http://www.nu.nl/</str> > > > <str>http://www.nu.nl/</str> > > > <str>http://www.nu.nl/algemeen/</str> > > > <str>http://www.nu.nl/economie/</str> > > > .... > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------b6e44d087bdd > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>36991</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> > > > A LOT OF TEXT HERE > > > </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:19:15.415Z</date> > > > </doc> > > > > > > > > > 3. a lot of garbage > > > > > > <doc> > > > <str name="id">doc1</str> > > > <arr name="ignored_content_encoding"> > > > <str>windows-1252</str> > > > </arr> > > > <arr name="ignored_content_language"> > > > <str>fr</str> > > > </arr> > > > <arr name="ignored_content_type"> > > > <str>text/plain</str> > > > </arr> > > > <arr name="ignored_language"> > > > <str>fr</str> > > > </arr> > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------83df0fd4d358 > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>361458</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> > > > A LOT OF GARBAGE HERE including > > > > > > ió½·Þp™ó 40› > > > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 > > > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ > > > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ > > > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë > > > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ > > > Ša Â=u×; (ä<�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v > > > ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€ 6E$Q > > > endstream > > > endobj > > > 137 0 > > > > > obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>> > > > endobj > > > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 > > > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV > > > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>> > > > endobj > > > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>> > > > endobj > > > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 > > > R]/Type/Pages/Parent 139 0 R>> > > > endobj > > > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 > > > R]/Type/Pages/Parent > > > > > > .... > > > > > > </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:21:28.306Z</date> > > > </doc> > > > > > > > > > Any ideas? Why doesn't the whitepaper produce any results and why is the > > > next whitepaper full of garbage? At least i'm happy that HTML works > > > fine. > > > > > > > > > > > > Regards, > > > > > > - > > > Markus Jelsma Buyways B.V. > > > Technisch Architect Friesestraatweg 215c > > > http://www.buyways.nl 9743 AD Groningen > > > > > > > > > Alg. 050-853 6600 KvK 01074105 > > > Tel. 050-853 6620 Fax. 050-3118124 > > > Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 > > > > > > > >