Re: nutch in solr

Matthew Parker Sun, 05 Feb 2012 09:24:16 -0800

No, they all don't run on 8983.

Tomcat's default port is 8080.


If you're using the embedded server in SOLR, you are using Jetty, which
runs on port 8983.

On Sun, Feb 5, 2012 at 11:54 AM, alessio crisantemi <
alessio.crisant...@gmail.com> wrote:

> no, all run on port 8983.
> ..
>
> 2012/2/5 Matthew Parker <mpar...@apogeeintegration.com>
>
> > Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
> > tomcat's default port to 8983?
> > On Feb 5, 2012 5:17 AM, "alessio crisantemi" <
> alessio.crisant...@gmail.com
> > >
> > wrote:
> >
> > > Hi All,
> > > I have some problems with integration of Nutch in Solr and Tomcat.
> > >
> > > I follo Nutch tutorial for integration and now, I can crawl a website:
> > all
> > > works right.
> > > But It I try the solr integration, I can't indexing on Solr.
> > >
> > > follow the nutch output after the command:
> > > bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN
> 5
> > >
> > > I read "java.lang.RuntimeException: Invalid version (expected 2, but 1)
> > or
> > > the data in not in 'javabin' format"
> > > MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY
> > BE
> > > IT REQUIRE A 3.X SOLR VERSION?
> > >
> > > thanks,
> > > a.
> > >
> > > crawl started in: crawl-20120203151719
> > > rootUrlDir = urls
> > > threads = 10
> > > depth = 3
> > > solrUrl=http://127.0.0.1:8983/solr/
> > > topN = 5
> > > Injector: starting at 2012-02-03 15:17:20
> > > Injector: crawlDb: crawl-20120203151719/crawldb
> > > Injector: urlDir: urls
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10
> > > Generator: starting at 2012-02-03 15:17:31
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: Partitioning selected urls for politeness.
> > > Generator: segment: crawl-20120203151719/segments/20120203151735
> > > Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07
> > > Fetcher: Your 'http.agent.name' value should be listed first in
> > > 'http.robots.agents' property.
> > > Fetcher: starting at 2012-02-03 15:17:39
> > > Fetcher: segment: crawl-20120203151719/segments/20120203151735
> > > Using queue mode : byHost
> > > Fetcher: threads: 10
> > > Fetcher: time-out divisor: 2
> > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > fetching http://www.gioconews.it/
> > > Using queue mode : byHost
> > > -finishing thread FetcherThread, activeThreads=3
> > > -finishing thread FetcherThread, activeThreads=2
> > > -finishing thread FetcherThread, activeThreads=1
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > -finishing thread FetcherThread, activeThreads=1
> > > Fetcher: throughput threshold: -1
> > > -finishing thread FetcherThread, activeThreads=1
> > > Fetcher: throughput threshold retries: 5
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > fetch of http://www.gioconews.it/ failed with:
> > > java.net.UnknownHostException: www.gioconews.it
> > > -finishing thread FetcherThread, activeThreads=0
> > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05
> > > ParseSegment: starting at 2012-02-03 15:17:44
> > > ParseSegment: segment: crawl-20120203151719/segments/20120203151735
> > > ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04
> > > CrawlDb update: starting at 2012-02-03 15:17:48
> > > CrawlDb update: db: crawl-20120203151719/crawldb
> > > CrawlDb update: segments:
> [crawl-20120203151719/segments/20120203151735]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: true
> > > CrawlDb update: URL filtering: true
> > > CrawlDb update: 404 purging: false
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05
> > > Generator: starting at 2012-02-03 15:17:53
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=1 - no more URLs to fetch.
> > > LinkDb: starting at 2012-02-03 15:17:57
> > > LinkDb: linkdb: crawl-20120203151719/linkdb
> > > LinkDb: URL normalize: true
> > > LinkDb: URL filter: true
> > > LinkDb: adding segment:
> > >
> > >
> >
> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735
> > > LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04
> > > SolrIndexer: starting at 2012-02-03 15:18:01
> > > java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
> > data
> > > in not in 'javabin' format
> > > SolrDeleteDuplicates: starting at 2012-02-03 15:18:09
> > > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> > > Exception in thread "main" java.io.IOException:
> > > org.apache.solr.client.solrj.SolrServerException: Error executing query
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
> > >        at
> > > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> > >        at
> > >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> > >        at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
> > >        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> > > executing query
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
> > >        at
> > > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
> > >        ... 9 more
> > > Caused by: java.lang.RuntimeException: Invalid version (expected 2, but
> > 1)
> > > or the data in not in 'javabin' format
> > >        at
> > >
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
> > >        ... 11 more
> > > Alessio@PC-Alessio /cygdrive/c/temp/apache-nutch-1.4-bin/runtime/local
> > > $ bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
> -topN
> > 5
> > > crawl started in: crawl-20120203162510
> > > rootUrlDir = urls
> > > threads = 10
> > > depth = 3
> > > solrUrl=http://127.0.0.1:8983/solr/
> > > topN = 5
> > > Injector: starting at 2012-02-03 16:25:11
> > > Injector: crawlDb: crawl-20120203162510/crawldb
> > > Injector: urlDir: urls
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: finished at 2012-02-03 16:25:20, elapsed: 00:00:09
> > > Generator: starting at 2012-02-03 16:25:20
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: Partitioning selected urls for politeness.
> > > Generator: segment: crawl-20120203162510/segments/20120203162525
> > > Generator: finished at 2012-02-03 16:25:28, elapsed: 00:00:08
> > > Fetcher: Your 'http.agent.name' value should be listed first in
> > > 'http.robots.agents' property.
> > > Fetcher: starting at 2012-02-03 16:25:28
> > > Fetcher: segment: crawl-20120203162510/segments/20120203162525
> > > Using queue mode : byHost
> > > Fetcher: threads: 10
> > > Fetcher: time-out divisor: 2
> > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > fetching http://www.gioconews.it/
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Using queue mode : byHost
> > > Fetcher: throughput threshold: -1
> > > Fetcher: throughput threshold retries: 5
> > > -finishing thread FetcherThread, activeThreads=2
> > > -finishing thread FetcherThread, activeThreads=3
> > > -finishing thread FetcherThread, activeThreads=6
> > > -finishing thread FetcherThread, activeThreads=5
> > > -finishing thread FetcherThread, activeThreads=5
> > > -finishing thread FetcherThread, activeThreads=4
> > > -finishing thread FetcherThread, activeThreads=3
> > > -finishing thread FetcherThread, activeThreads=2
> > > -finishing thread FetcherThread, activeThreads=1
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > fetch of http://www.gioconews.it/ failed with:
> > > java.net.UnknownHostException: www.gioconews.it
> > > -finishing thread FetcherThread, activeThreads=0
> > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: finished at 2012-02-03 16:25:47, elapsed: 00:00:18
> > > ParseSegment: starting at 2012-02-03 16:25:47
> > > ParseSegment: segment: crawl-20120203162510/segments/20120203162525
> > > ParseSegment: finished at 2012-02-03 16:25:51, elapsed: 00:00:04
> > > CrawlDb update: starting at 2012-02-03 16:25:52
> > > CrawlDb update: db: crawl-20120203162510/crawldb
> > > CrawlDb update: segments:
> [crawl-20120203162510/segments/20120203162525]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: true
> > > CrawlDb update: URL filtering: true
> > > CrawlDb update: 404 purging: false
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: finished at 2012-02-03 16:25:57, elapsed: 00:00:05
> > > Generator: starting at 2012-02-03 16:25:58
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=1 - no more URLs to fetch.
> > > LinkDb: starting at 2012-02-03 16:26:01
> > > LinkDb: linkdb: crawl-20120203162510/linkdb
> > > LinkDb: URL normalize: true
> > > LinkDb: URL filter: true
> > > LinkDb: adding segment:
> > >
> > >
> >
> file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203162510/segments/20120203162525
> > > LinkDb: finished at 2012-02-03 16:26:05, elapsed: 00:00:04
> > > SolrIndexer: starting at 2012-02-03 16:26:06
> > > java.lang.RuntimeException: Invalid version (expected 2, but 1) or the
> > data
> > > in not in 'javabin' format
> > > SolrDeleteDuplicates: starting at 2012-02-03 16:26:13
> > > SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/
> > > Exception in thread "main" java.io.IOException:
> > > org.apache.solr.client.solrj.SolrServerException: Error executing query
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
> > >        at
> > > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> > >        at
> > >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> > >        at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> > >        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
> > >        at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> > > Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> > > executing query
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
> > >        at
> > > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> > >        at
> > >
> > >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198)
> > >        ... 9 more
> > > Caused by: java.lang.RuntimeException: Invalid version (expected 2, but
> > 1)
> > > or the data in not in 'javabin' format
> > >        at
> > >
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> > >        at
> > >
> > >
> >
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
> > >        ... 11 more
> > >
> >
> > ------------------------------
> > This e-mail and any files transmitted with it may be proprietary.  Please
> > note that any views or opinions presented in this e-mail are solely those
> > of the author and do not necessarily represent those of Apogee
> Integration.
> >
>



-- 
Regards,

Matt Parker (CTR)
Senior Software Architect
Apogee Integration, LLC
5180 Parkstone Drive, Suite #160
Chantilly, Virginia 20151
703.272.4797 (site)
703.474.1918 (cell)
www.apogeeintegration.com

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.

Re: nutch in solr

Reply via email to