hi, i'm new in using apache nutch and solr... has anyone from the list 
experiences in indexing nutch crawls into solr? the main problem is, that e.g. 
nutch crawled pdf documents (with the other stuff from the crawled site) after 
solr-indexing isn't queryable... e.g. 

query in nutch: 

bin/nutch org.apache.nutch.searcher.NutchBean Skript.pdf

Total hits: 2
 0 20100406141339/http://NAME_OF_WEBSITE/Betriebssysteme/Skript.pdf
Hochschule f?r Technik, Wirtschaft
und Kultur Leipzig (FH)
Fachbereich Informatik, Mathematik
und Naturwissenschaften
Vorlesung
Betriebssysteme
Wintersemester 2008/09
Prof.  ... 
 1 20100406141255/http://NAME_OF_WEBSITE/Betriebssysteme/
 ... modified Permissions ? Parent Directory Directory - - - # Skript.pdf PDF 
File 270.37 KB 07 ... 

highlighted query in solr (after indexing the nutch segments, crawldb and 
linkdb):

http://NAME_OF_WEBSITE/Medienrecht/ 
Index of /Medienrecht/ Index of / Medienrecht/   Name Type Size Last modified 
Permissions « Parent Directory Directory - - - # <em>Skript.pdf</em> PDF File 
105.14 KB 19-Jan-2009 18:03 -rw-r--r-- ? 0 directories 1 files   105.14 KB     

http://NAME_OF_WEBSITE/Betriebssysteme/ 
Index of /Betriebssysteme/ Index of / Betriebssysteme/   Name Type Size Last 
modified Permissions « Parent Directory Directory - - - # <em>Skript.pdf</em> 
PDF File 270.37 KB 07-Jan-2009 23:03 -rw-r--r-- # U1.zip ZIP File 286 Byte 
24-Nov-2008 19:29 -rw-r--r-- # U10.zip ZIP File 821 Byte 18-Dec-2008 19:27 
-rw-r--r-- # U11.zip ZIP File 1.03 KB 22-Jan-2009 23:03 -rw-r--r-- # U12.zip 
ZIP File 1.40 KB 22-Jan-2009 23:03 -rw-r--r-- # U13.zip ZIP File 1.92 KB 
03-Apr-2009 18:24 -rw-r--r-- # U2.zip ZIP File 354 Byte 24-Nov-2008 19:29 
-rw-r--r-- # U3.zip ZIP File 378 Byte 24-Nov-2008 19:29 -rw-r--r-- # U4.zip ZIP

but the list of results doesn't contain the pdf documents :( ... [id (url is 
copied to the unique key id) and the fragsize]... at the nutch mailing list i 
got no answer to a similar question :/

all works perfectly, if i index pdf documents with solr ...

best regards marcel :)


Reply via email to