Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr.
follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read "java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format" MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 15:18:01 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format SolrDeleteDuplicates: starting at 2012-02-03 15:18:09 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/ Exception in thread "main" java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198) ... 9 more Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 11 more Alessio@PC-Alessio /cygdrive/c/temp/apache-nutch-1.4-bin/runtime/local $ bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 crawl started in: crawl-20120203162510 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 16:25:11 Injector: crawlDb: crawl-20120203162510/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 16:25:20, elapsed: 00:00:09 Generator: starting at 2012-02-03 16:25:20 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203162510/segments/20120203162525 Generator: finished at 2012-02-03 16:25:28, elapsed: 00:00:08 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 16:25:28 Fetcher: segment: crawl-20120203162510/segments/20120203162525 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 16:25:47, elapsed: 00:00:18 ParseSegment: starting at 2012-02-03 16:25:47 ParseSegment: segment: crawl-20120203162510/segments/20120203162525 ParseSegment: finished at 2012-02-03 16:25:51, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 16:25:52 CrawlDb update: db: crawl-20120203162510/crawldb CrawlDb update: segments: [crawl-20120203162510/segments/20120203162525] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 16:25:57, elapsed: 00:00:05 Generator: starting at 2012-02-03 16:25:58 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 16:26:01 LinkDb: linkdb: crawl-20120203162510/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203162510/segments/20120203162525 LinkDb: finished at 2012-02-03 16:26:05, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 16:26:06 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format SolrDeleteDuplicates: starting at 2012-02-03 16:26:13 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/ Exception in thread "main" java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198) ... 9 more Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 11 more