nutch in solr

alessio crisantemi Sun, 05 Feb 2012 02:17:35 -0800

Hi All,
I have some problems with integration of Nutch in Solr and Tomcat.

I follo Nutch tutorial for integration and now, I can crawl a website: all
works right.
But It I try the solr integration, I can't indexing on Solr.


follow the nutch output after the command:
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5

I read "java.lang.RuntimeException: Invalid version (expected 2, but 1) or
the data in not in 'javabin' format"
MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE
IT REQUIRE A 3.X SOLR VERSION?

thanks,
a.

crawl started in: crawl-20120203151719
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://127.0.0.1:8983/solr/
topN = 5
Injector: starting at 2012-02-03 15:17:20
Injector: crawlDb: crawl-20120203151719/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
Generator: starting at 2012-02-03 15:17:31
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20120203151719/segments/20120203151735
Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-02-03 15:17:39
Fetcher: segment: crawl-20120203151719/segments/20120203151735
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.gioconews.it/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
fetch of http://www.gioconews.it/ failed with:
java.net.UnknownHostException: www.gioconews.it
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
ParseSegment: starting at 2012-02-03 15:17:44
ParseSegment: segment: crawl-20120203151719/segments/20120203151735
ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
CrawlDb update: starting at 2012-02-03 15:17:48
CrawlDb update: db: crawl-20120203151719/crawldb
CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
Generator: starting at 2012-02-03 15:17:53
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-02-03 15:17:57
LinkDb: linkdb: crawl-20120203151719/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
SolrIndexer: starting at 2012-02-03 15:18:01
java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data
in not in 'javabin' format
SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
        at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
        at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
        ... 9 more
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1)
or the data in not in 'javabin' format
        at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
        ... 11 more
Alessio@PC-Alessio /cygdrive/c/temp/apache-nutch-1.4-bin/runtime/local
$ bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5
crawl started in: crawl-20120203162510
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://127.0.0.1:8983/solr/
topN = 5
Injector: starting at 2012-02-03 16:25:11
Injector: crawlDb: crawl-20120203162510/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-03 16:25:20, elapsed: 00:00:09
Generator: starting at 2012-02-03 16:25:20
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl-20120203162510/segments/20120203162525
Generator: finished at 2012-02-03 16:25:28, elapsed: 00:00:08
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-02-03 16:25:28
Fetcher: segment: crawl-20120203162510/segments/20120203162525
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.gioconews.it/
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.gioconews.it/ failed with:
java.net.UnknownHostException: www.gioconews.it
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-03 16:25:47, elapsed: 00:00:18
ParseSegment: starting at 2012-02-03 16:25:47
ParseSegment: segment: crawl-20120203162510/segments/20120203162525
ParseSegment: finished at 2012-02-03 16:25:51, elapsed: 00:00:04
CrawlDb update: starting at 2012-02-03 16:25:52
CrawlDb update: db: crawl-20120203162510/crawldb
CrawlDb update: segments: [crawl-20120203162510/segments/20120203162525]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-03 16:25:57, elapsed: 00:00:05
Generator: starting at 2012-02-03 16:25:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-02-03 16:26:01
LinkDb: linkdb: crawl-20120203162510/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203162510/segments/20120203162525
LinkDb: finished at 2012-02-03 16:26:05, elapsed: 00:00:04
SolrIndexer: starting at 2012-02-03 16:26:06
java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data
in not in 'javabin' format
SolrDeleteDuplicates: starting at 2012-02-03 16:26:13
SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
Exception in thread "main" java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
        at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
        at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
        ... 9 more
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1)
or the data in not in 'javabin' format
        at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
        ... 11 more

nutch in solr

Reply via email to