Re: nutch in solr

Geek Gamer Sun, 05 Feb 2012 09:40:13 -0800

solj is the solr java client library,

so there seem to be two versions 1.4.1 and 3.4.0, which are
incompatible,  so you can do the following,


refer : 
https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d

to see how to upgrade the solr version in nutch, teh above example
replaces solr 1.4.0 with 3.1.0.




On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi
<alessio.crisant...@gmail.com> wrote:
> if I look the solr and nuth libs I found:
> apache-solr-solrj-1.4.1.jar on Solr
> and
> solr-solrj-3.4.0.jar
>
> this are the only jar files with a word 'solrj'....
> taht's the problem?!
>
> 2012/2/5 Geek Gamer <geek4...@gmail.com>
>
>> looks like solrj version in nutch classpath is different that the solr
>> version on server,
>> can you  post the versions for both nutch and solr?
>>
>>
>> On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
>> <alessio.crisant...@gmail.com> wrote:
>> > no, all run on port 8983.
>> > ..
>> >
>> > 2012/2/5 Matthew Parker <mpar...@apogeeintegration.com>
>> >
>> >> Doesn't tomcat run on port 8080, and not port 8983? Or did you change
>> the
>> >> tomcat's default port to 8983?
>> >> On Feb 5, 2012 5:17 AM, "alessio crisantemi" <
>> alessio.crisant...@gmail.com
>> >> >
>> >> wrote:
>> >>
>> >> > Hi All,
>> >> > I have some problems with integration of Nutch in Solr and Tomcat.
>> >> >
>> >> > I follo Nutch tutorial for integration and now, I can crawl a website:
>> >> all
>> >> > works right.
>> >> > But It I try the solr integration, I can't indexing on Solr.
>> >> >
>> >> > follow the nutch output after the command:
>> >> > bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
>> -topN 5
>> >> >
>> >> > I read "java.lang.RuntimeException: Invalid version (expected 2, but
>> 1)
>> >> or
>> >> > the data in not in 'javabin' format"
>> >> > MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1?
>> MAY
>> >> BE
>> >> > IT REQUIRE A 3.X SOLR VERSION?
>> >> >
>> >> > thanks,
>> >> > a.
>> >> >
>> >> > crawl started in: crawl-20120203151719
>> >> > rootUrlDir = urls
>> >> > threads = 10
>> >> > depth = 3
>> >> > solrUrl=http://127.0.0.1:8983/solr/
>> >> > topN = 5
>> >> > Injector: starting at 2012-02-03 15:17:20
>> >> > Injector: crawlDb: crawl-20120203151719/crawldb
>> >> > Injector: urlDir: urls
>> >> > Injector: Converting injected urls to crawl db entries.
>> >> > Injector: Merging injected urls into crawl db.
>> >> > Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
>> >> > Generator: starting at 2012-02-03 15:17:31
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: filtering: true
>> >> > Generator: normalizing: true
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: Partitioning selected urls for politeness.
>> >> > Generator: segment: crawl-20120203151719/segments/20120203151735
>> >> > Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
>> >> > Fetcher: Your 'http.agent.name' value should be listed first in
>> >> > 'http.robots.agents' property.
>> >> > Fetcher: starting at 2012-02-03 15:17:39
>> >> > Fetcher: segment: crawl-20120203151719/segments/20120203151735
>> >> > Using queue mode : byHost
>> >> > Fetcher: threads: 10
>> >> > Fetcher: time-out divisor: 2
>> >> > QueueFeeder finished: total 1 records + hit by time limit :0
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > fetching http://www.gioconews.it/
>> >> > Using queue mode : byHost
>> >> > -finishing thread FetcherThread, activeThreads=3
>> >> > -finishing thread FetcherThread, activeThreads=2
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > Fetcher: throughput threshold: -1
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > Fetcher: throughput threshold retries: 5
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > fetch of http://www.gioconews.it/ failed with:
>> >> > java.net.UnknownHostException: www.gioconews.it
>> >> > -finishing thread FetcherThread, activeThreads=0
>> >> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=0
>> >> > Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
>> >> > ParseSegment: starting at 2012-02-03 15:17:44
>> >> > ParseSegment: segment: crawl-20120203151719/segments/20120203151735
>> >> > ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
>> >> > CrawlDb update: starting at 2012-02-03 15:17:48
>> >> > CrawlDb update: db: crawl-20120203151719/crawldb
>> >> > CrawlDb update: segments:
>> [crawl-20120203151719/segments/20120203151735]
>> >> > CrawlDb update: additions allowed: true
>> >> > CrawlDb update: URL normalizing: true
>> >> > CrawlDb update: URL filtering: true
>> >> > CrawlDb update: 404 purging: false
>> >> > CrawlDb update: Merging segment data into db.
>> >> > CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
>> >> > Generator: starting at 2012-02-03 15:17:53
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: filtering: true
>> >> > Generator: normalizing: true
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=1 - no more URLs to fetch.
>> >> > LinkDb: starting at 2012-02-03 15:17:57
>> >> > LinkDb: linkdb: crawl-20120203151719/linkdb
>> >> > LinkDb: URL normalize: true
>> >> > LinkDb: URL filter: true
>> >> > LinkDb: adding segment:
>> >> >
>> >> >
>> >>
>> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
>> >> > LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
>> >> > SolrIndexer: starting at 2012-02-03 15:18:01
>> >> > java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
>> >> data
>> >> > in not in 'javabin' format
>> >> > SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
>> >> > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
>> >> > Exception in thread "main" java.io.IOException:
>> >> > org.apache.solr.client.solrj.SolrServerException: Error executing
>> query
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
>> >> >        at
>> >> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> >> >        at
>> >> >
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> >> >        at
>> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> >> >        at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>> >> >        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>> >> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >> >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>> >> > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
>> >> > executing query
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>> >> >        at
>> >> > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
>> >> >        ... 9 more
>> >> > Caused by: java.lang.RuntimeException: Invalid version (expected 2,
>> but
>> >> 1)
>> >> > or the data in not in 'javabin' format
>> >> >        at
>> >> >
>> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>> >> >        ... 11 more
>> >> > Alessio@PC-Alessio/cygdrive/c/temp/apache-nutch-1.4-bin/runtime/local
>> >> > $ bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
>> -topN
>> >> 5
>> >> > crawl started in: crawl-20120203162510
>> >> > rootUrlDir = urls
>> >> > threads = 10
>> >> > depth = 3
>> >> > solrUrl=http://127.0.0.1:8983/solr/
>> >> > topN = 5
>> >> > Injector: starting at 2012-02-03 16:25:11
>> >> > Injector: crawlDb: crawl-20120203162510/crawldb
>> >> > Injector: urlDir: urls
>> >> > Injector: Converting injected urls to crawl db entries.
>> >> > Injector: Merging injected urls into crawl db.
>> >> > Injector: finished at 2012-02-03 16:25:20, elapsed: 00:00:09
>> >> > Generator: starting at 2012-02-03 16:25:20
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: filtering: true
>> >> > Generator: normalizing: true
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: Partitioning selected urls for politeness.
>> >> > Generator: segment: crawl-20120203162510/segments/20120203162525
>> >> > Generator: finished at 2012-02-03 16:25:28, elapsed: 00:00:08
>> >> > Fetcher: Your 'http.agent.name' value should be listed first in
>> >> > 'http.robots.agents' property.
>> >> > Fetcher: starting at 2012-02-03 16:25:28
>> >> > Fetcher: segment: crawl-20120203162510/segments/20120203162525
>> >> > Using queue mode : byHost
>> >> > Fetcher: threads: 10
>> >> > Fetcher: time-out divisor: 2
>> >> > QueueFeeder finished: total 1 records + hit by time limit :0
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > fetching http://www.gioconews.it/
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Using queue mode : byHost
>> >> > Fetcher: throughput threshold: -1
>> >> > Fetcher: throughput threshold retries: 5
>> >> > -finishing thread FetcherThread, activeThreads=2
>> >> > -finishing thread FetcherThread, activeThreads=3
>> >> > -finishing thread FetcherThread, activeThreads=6
>> >> > -finishing thread FetcherThread, activeThreads=5
>> >> > -finishing thread FetcherThread, activeThreads=5
>> >> > -finishing thread FetcherThread, activeThreads=4
>> >> > -finishing thread FetcherThread, activeThreads=3
>> >> > -finishing thread FetcherThread, activeThreads=2
>> >> > -finishing thread FetcherThread, activeThreads=1
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >> > fetch of http://www.gioconews.it/ failed with:
>> >> > java.net.UnknownHostException: www.gioconews.it
>> >> > -finishing thread FetcherThread, activeThreads=0
>> >> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >> > -activeThreads=0
>> >> > Fetcher: finished at 2012-02-03 16:25:47, elapsed: 00:00:18
>> >> > ParseSegment: starting at 2012-02-03 16:25:47
>> >> > ParseSegment: segment: crawl-20120203162510/segments/20120203162525
>> >> > ParseSegment: finished at 2012-02-03 16:25:51, elapsed: 00:00:04
>> >> > CrawlDb update: starting at 2012-02-03 16:25:52
>> >> > CrawlDb update: db: crawl-20120203162510/crawldb
>> >> > CrawlDb update: segments:
>> [crawl-20120203162510/segments/20120203162525]
>> >> > CrawlDb update: additions allowed: true
>> >> > CrawlDb update: URL normalizing: true
>> >> > CrawlDb update: URL filtering: true
>> >> > CrawlDb update: 404 purging: false
>> >> > CrawlDb update: Merging segment data into db.
>> >> > CrawlDb update: finished at 2012-02-03 16:25:57, elapsed: 00:00:05
>> >> > Generator: starting at 2012-02-03 16:25:58
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: filtering: true
>> >> > Generator: normalizing: true
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=1 - no more URLs to fetch.
>> >> > LinkDb: starting at 2012-02-03 16:26:01
>> >> > LinkDb: linkdb: crawl-20120203162510/linkdb
>> >> > LinkDb: URL normalize: true
>> >> > LinkDb: URL filter: true
>> >> > LinkDb: adding segment:
>> >> >
>> >> >
>> >>
>> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203162510/segments/20120203162525
>> >> > LinkDb: finished at 2012-02-03 16:26:05, elapsed: 00:00:04
>> >> > SolrIndexer: starting at 2012-02-03 16:26:06
>> >> > java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
>> >> data
>> >> > in not in 'javabin' format
>> >> > SolrDeleteDuplicates: starting at 2012-02-03 16:26:13
>> >> > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
>> >> > Exception in thread "main" java.io.IOException:
>> >> > org.apache.solr.client.solrj.SolrServerException: Error executing
>> query
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
>> >> >        at
>> >> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> >> >        at
>> >> >
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> >> >        at
>> >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>> >> >        at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>> >> >        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>> >> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >> >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>> >> > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
>> >> > executing query
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>> >> >        at
>> >> > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
>> >> >        ... 9 more
>> >> > Caused by: java.lang.RuntimeException: Invalid version (expected 2,
>> but
>> >> 1)
>> >> > or the data in not in 'javabin' format
>> >> >        at
>> >> >
>> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>> >> >        at
>> >> >
>> >> >
>> >>
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>> >> >        ... 11 more
>> >> >
>> >>
>> >> ------------------------------
>> >> This e-mail and any files transmitted with it may be proprietary.
>>  Please
>> >> note that any views or opinions presented in this e-mail are solely
>> those
>> >> of the author and do not necessarily represent those of Apogee
>> Integration.
>> >>
>>

Re: nutch in solr

Reply via email to