Re: nutch and solr

alessio crisantemi Mon, 27 Feb 2012 05:00:11 -0800

now, all works!

I have another problem If I use a conector with my solr-nutch.
this is the error:


Grave: java.lang.RuntimeException:
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
 at org.apache.solr.core.SolrCore.<init>(SolrCore.java:579)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
 at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
 at connector.SolrConnector.<init>(SolrConnector.java:33)
 at connector.SolrConnector.getInstance(SolrConnector.java:69)
 at connector.SolrConnector.getSolrServer(SolrConnector.java:77)
 at connector.QueryServlet.doGet(QueryServlet.java:117)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format
version: -11
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247)
 at
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72)
 at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
 at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
 at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057)
 ... 26 more

SUGGESTIONS?
thanks,
alessio

Il giorno 25 febbraio 2012 10:52, alessio crisantemi <
alessio.crisant...@gmail.com> ha scritto:

> thi is the problem!
> Becaus in my root there is a url!
>
> I write you my step-by-step configuration of nutch:
> (I use cygwin because I work on windows)
>
> *1. Extract the Nutch package*
>
> *2. Configure Solr*
> (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to
> directory apache-solr-1.3.0/example/solr/conf (override the existing
> file) for *to allow Solr to create the snippets for search results so we
> need to store the content in addition to indexing it:
>
> *b. Change schema.xml so that the stored attribute of field “content” is
> true.*
>
> *<field name=”content” type=”text” stored=”true” indexed=”true”/>*
>
> We want to be able to tweak the relevancy of queries easily so we’ll
> create new dismax request handler configuration for our use case:
>
> *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste
> following fragment to it*
>
> <requestHandler name="/nutch" class="solr.SearchHandler" >
>
> <lst name="defaults">
>
> <str name="defType">dismax</str>
>
> <str name="echoParams">explicit</str>
>
> <float name="tie">0.01</float>
>
> <str name="qf">
>
> content^0.5 anchor^1.0 title^1.2
>
> </str>
>
> <str name="pf">
>
> content^0.5 anchor^1.5 title^1.2 site^1.5
>
> </str>
>
> <str name="fl">
>
> url
>
> </str>
>
> <str name="mm">
>
> 2&lt;-1 5&lt;-2 6&lt;90%
>
> </str>
>
> <int name="ps">100</int>
>
> <bool hl="true"/>
>
> <str name="q.alt">*:*</str>
>
> <str name="hl.fl">title url content</str>
>
> <str name="f.title.hl.fragsize">0</str>
>
> <str name="f.title.hl.alternateField">title</str>
>
> <str name="f.url.hl.fragsize">0</str>
>
> <str name="f.url.hl.alternateField">url</str>
>
> <str name="f.content.hl.fragmenter">regex</str>
>
> </lst>
>
> </requestHandler>
>
> *3. Start Solr*
>
> cd apache-solr-1.3.0/example
>
> java -jar start.jar
>
> *4. Configure Nutch*
>
> *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
> contents with the following (we specify our crawler name, active plugins
> and limit maximum url count for single host per run to be 100) :*
>
> <?xml version="1.0"?>
>
> <configuration>
>
> <property>
>
> <name>http.agent.name</name>
>
> <value>nutch-solr-integration</value>
>
> </property>
>
> <property>
>
> <name>generate.max.per.host</name>
>
> <value>100</value>
>
> </property>
>
> <property>
>
> <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> </property>
>
> </configuration>
>
> *b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace
> it’s content with following:*
>
> -^(https|telnet|file|ftp|mailto):
>
>
>
> # skip some suffixes
>
> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|
> FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|
> png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|
> jpeg|JPEG|bmp|BMP)$
>
>
>
> # skip URLs containing certain characters as probable queries, etc.
>
> -[?*!@=]
>
>
>
> # allow urls in foofactory.fi domain
>
> +^http:*//([a-z0-9\-A-Z]*\.)*google.it/*
>
>
>
> # deny anything *else*
>
> -.
>
> *5. Create a seed list (the initial urls to fetch)*
>
> mkdir urls *(crea una cartella ‘urls’)*
>
> echo "http://www.google.it/"; > urls/seed.txt
>
> *6. Inject seed url(s) to nutch crawldb (execute in nutch directory)*
>
> bin/nutch inject crawl/crawldb urls
> AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion?
> thank you
> alessio
>
> Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in <
> tamanjit.bin...@yahoo.co.in> ha scritto:
>
> The empty path message is becayse nutch is unable to find a url in the url
>> location that you provide.
>>
>> Kindly ensure there is a url there.
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Re: nutch and solr

Reply via email to