Re: nutch in solr

Matthew Parker Sun, 05 Feb 2012 07:58:16 -0800

Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
tomcat's default port to 8983?
On Feb 5, 2012 5:17 AM, "alessio crisantemi" <[email protected]>
wrote:


> Hi All,
> I have some problems with integration of Nutch in Solr and Tomcat.
>
> I follo Nutch tutorial for integration and now, I can crawl a website: all
> works right.
> But It I try the solr integration, I can't indexing on Solr.
>
> follow the nutch output after the command:
> bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5
>
> I read "java.lang.RuntimeException: Invalid version (expected 2, but 1) or
> the data in not in 'javabin' format"
> MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE
> IT REQUIRE A 3.X SOLR VERSION?
>
> thanks,
> a.
>
> crawl started in: crawl-20120203151719
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=http://127.0.0.1:8983/solr/
> topN = 5
> Injector: starting at 2012-02-03 15:17:20
> Injector: crawlDb: crawl-20120203151719/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
> Generator: starting at 2012-02-03 15:17:31
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl-20120203151719/segments/20120203151735
> Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-02-03 15:17:39
> Fetcher: segment: crawl-20120203151719/segments/20120203151735
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.gioconews.it/
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold: -1
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> fetch of http://www.gioconews.it/ failed with:
> java.net.UnknownHostException: www.gioconews.it
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
> ParseSegment: starting at 2012-02-03 15:17:44
> ParseSegment: segment: crawl-20120203151719/segments/20120203151735
> ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
> CrawlDb update: starting at 2012-02-03 15:17:48
> CrawlDb update: db: crawl-20120203151719/crawldb
> CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
> Generator: starting at 2012-02-03 15:17:53
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2012-02-03 15:17:57
> LinkDb: linkdb: crawl-20120203151719/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
>
> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
> LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
> SolrIndexer: starting at 2012-02-03 15:18:01
> java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data
> in not in 'javabin' format
> SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing query
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> executing query
>        at
>
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>        at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
>        ... 9 more
> Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1)
> or the data in not in 'javabin' format
>        at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>        at
>
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
>        at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
>        at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>        at
>
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>        ... 11 more
> Alessio@PC-Alessio /cygdrive/c/temp/apache-nutch-1.4-bin/runtime/local
> $ bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5
> crawl started in: crawl-20120203162510
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=http://127.0.0.1:8983/solr/
> topN = 5
> Injector: starting at 2012-02-03 16:25:11
> Injector: crawlDb: crawl-20120203162510/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-02-03 16:25:20, elapsed: 00:00:09
> Generator: starting at 2012-02-03 16:25:20
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl-20120203162510/segments/20120203162525
> Generator: finished at 2012-02-03 16:25:28, elapsed: 00:00:08
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-02-03 16:25:28
> Fetcher: segment: crawl-20120203162510/segments/20120203162525
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.gioconews.it/
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> fetch of http://www.gioconews.it/ failed with:
> java.net.UnknownHostException: www.gioconews.it
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-02-03 16:25:47, elapsed: 00:00:18
> ParseSegment: starting at 2012-02-03 16:25:47
> ParseSegment: segment: crawl-20120203162510/segments/20120203162525
> ParseSegment: finished at 2012-02-03 16:25:51, elapsed: 00:00:04
> CrawlDb update: starting at 2012-02-03 16:25:52
> CrawlDb update: db: crawl-20120203162510/crawldb
> CrawlDb update: segments: [crawl-20120203162510/segments/20120203162525]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-02-03 16:25:57, elapsed: 00:00:05
> Generator: starting at 2012-02-03 16:25:58
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2012-02-03 16:26:01
> LinkDb: linkdb: crawl-20120203162510/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
>
> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203162510/segments/20120203162525
> LinkDb: finished at 2012-02-03 16:26:05, elapsed: 00:00:04
> SolrIndexer: starting at 2012-02-03 16:26:06
> java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data
> in not in 'javabin' format
> SolrDeleteDuplicates: starting at 2012-02-03 16:26:13
> SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> Exception in thread "main" java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException: Error executing query
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> executing query
>        at
>
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>        at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>        at
>
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
>        ... 9 more
> Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1)
> or the data in not in 'javabin' format
>        at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>        at
>
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
>        at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
>        at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>        at
>
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
>        ... 11 more
>

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.

Re: nutch in solr

Reply via email to