hi, i'm new in using apache nutch and solr... has anyone from the list experiences in indexing nutch crawls into solr? the main problem is, that e.g. nutch crawled pdf documents (with the other stuff from the crawled site) after solr-indexing isn't queryable... e.g.
query in nutch: bin/nutch org.apache.nutch.searcher.NutchBean Skript.pdf Total hits: 2 0 20100406141339/http://NAME_OF_WEBSITE/Betriebssysteme/Skript.pdf Hochschule f?r Technik, Wirtschaft und Kultur Leipzig (FH) Fachbereich Informatik, Mathematik und Naturwissenschaften Vorlesung Betriebssysteme Wintersemester 2008/09 Prof. ... 1 20100406141255/http://NAME_OF_WEBSITE/Betriebssysteme/ ... modified Permissions ? Parent Directory Directory - - - # Skript.pdf PDF File 270.37 KB 07 ... highlighted query in solr (after indexing the nutch segments, crawldb and linkdb): http://NAME_OF_WEBSITE/Medienrecht/ Index of /Medienrecht/ Index of / Medienrecht/ Name Type Size Last modified Permissions « Parent Directory Directory - - - # <em>Skript.pdf</em> PDF File 105.14 KB 19-Jan-2009 18:03 -rw-r--r-- ? 0 directories 1 files 105.14 KB http://NAME_OF_WEBSITE/Betriebssysteme/ Index of /Betriebssysteme/ Index of / Betriebssysteme/ Name Type Size Last modified Permissions « Parent Directory Directory - - - # <em>Skript.pdf</em> PDF File 270.37 KB 07-Jan-2009 23:03 -rw-r--r-- # U1.zip ZIP File 286 Byte 24-Nov-2008 19:29 -rw-r--r-- # U10.zip ZIP File 821 Byte 18-Dec-2008 19:27 -rw-r--r-- # U11.zip ZIP File 1.03 KB 22-Jan-2009 23:03 -rw-r--r-- # U12.zip ZIP File 1.40 KB 22-Jan-2009 23:03 -rw-r--r-- # U13.zip ZIP File 1.92 KB 03-Apr-2009 18:24 -rw-r--r-- # U2.zip ZIP File 354 Byte 24-Nov-2008 19:29 -rw-r--r-- # U3.zip ZIP File 378 Byte 24-Nov-2008 19:29 -rw-r--r-- # U4.zip ZIP but the list of results doesn't contain the pdf documents :( ... [id (url is copied to the unique key id) and the fragsize]... at the nutch mailing list i got no answer to a similar question :/ all works perfectly, if i index pdf documents with solr ... best regards marcel :)