now, all works! I have another problem If I use a conector with my solr-nutch. this is the error:
Grave: java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException: Unknown format version: -11 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:579) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at connector.SolrConnector.<init>(SolrConnector.java:33) at connector.SolrConnector.getInstance(SolrConnector.java:69) at connector.SolrConnector.getSolrServer(SolrConnector.java:77) at connector.QueryServlet.doGet(QueryServlet.java:117) at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) at javax.servlet.http.HttpServlet.service(HttpServlet.java:722) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format version: -11 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at org.apache.lucene.index.IndexReader.open(IndexReader.java:403) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057) ... 26 more SUGGESTIONS? thanks, alessio Il giorno 25 febbraio 2012 10:52, alessio crisantemi < alessio.crisant...@gmail.com> ha scritto: > thi is the problem! > Becaus in my root there is a url! > > I write you my step-by-step configuration of nutch: > (I use cygwin because I work on windows) > > *1. Extract the Nutch package* > > *2. Configure Solr* > (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to > directory apache-solr-1.3.0/example/solr/conf (override the existing > file) for *to allow Solr to create the snippets for search results so we > need to store the content in addition to indexing it: > > *b. Change schema.xml so that the stored attribute of field “content” is > true.* > > *<field name=”content” type=”text” stored=”true” indexed=”true”/>* > > We want to be able to tweak the relevancy of queries easily so we’ll > create new dismax request handler configuration for our use case: > > *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste > following fragment to it* > > <requestHandler name="/nutch" class="solr.SearchHandler" > > > <lst name="defaults"> > > <str name="defType">dismax</str> > > <str name="echoParams">explicit</str> > > <float name="tie">0.01</float> > > <str name="qf"> > > content^0.5 anchor^1.0 title^1.2 > > </str> > > <str name="pf"> > > content^0.5 anchor^1.5 title^1.2 site^1.5 > > </str> > > <str name="fl"> > > url > > </str> > > <str name="mm"> > > 2<-1 5<-2 6<90% > > </str> > > <int name="ps">100</int> > > <bool hl="true"/> > > <str name="q.alt">*:*</str> > > <str name="hl.fl">title url content</str> > > <str name="f.title.hl.fragsize">0</str> > > <str name="f.title.hl.alternateField">title</str> > > <str name="f.url.hl.fragsize">0</str> > > <str name="f.url.hl.alternateField">url</str> > > <str name="f.content.hl.fragmenter">regex</str> > > </lst> > > </requestHandler> > > *3. Start Solr* > > cd apache-solr-1.3.0/example > > java -jar start.jar > > *4. Configure Nutch* > > *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s > contents with the following (we specify our crawler name, active plugins > and limit maximum url count for single host per run to be 100) :* > > <?xml version="1.0"?> > > <configuration> > > <property> > > <name>http.agent.name</name> > > <value>nutch-solr-integration</value> > > </property> > > <property> > > <name>generate.max.per.host</name> > > <value>100</value> > > </property> > > <property> > > <name>plugin.includes</name> > > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > </property> > > </configuration> > > *b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace > it’s content with following:* > > -^(https|telnet|file|ftp|mailto): > > > > # skip some suffixes > > -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv| > FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG| > png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe| > jpeg|JPEG|bmp|BMP)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > -[?*!@=] > > > > # allow urls in foofactory.fi domain > > +^http:*//([a-z0-9\-A-Z]*\.)*google.it/* > > > > # deny anything *else* > > -. > > *5. Create a seed list (the initial urls to fetch)* > > mkdir urls *(crea una cartella ‘urls’)* > > echo "http://www.google.it/" > urls/seed.txt > > *6. Inject seed url(s) to nutch crawldb (execute in nutch directory)* > > bin/nutch inject crawl/crawldb urls > AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion? > thank you > alessio > > Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in < > tamanjit.bin...@yahoo.co.in> ha scritto: > > The empty path message is becayse nutch is unable to find a url in the url >> location that you provide. >> >> Kindly ensure there is a url there. >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >