I did a manual commit, and we are still missing docs, so it doesn't look
like the search race condition you mention.

My boss wasn't happy when i mentioned that I wanted to try out unreleased
code. Ill get him won over though and return with my findings. It will
probably be some time next week.

Thanks for your help.


-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Thu, Dec 13, 2012 at 4:10 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Couple things to start:
>
> By default SolrCloud distributes updates a doc at a time. So if you have 1
> shard, whatever node you index too, it will send updates to the other.
> Replication is only used for recovery, not distributing data. So for some
> reason, there is an IOException when it tries to forward.
>
> The other issue is not something that Ive seen reported. Can/did you try
> and do another hard commit to make sure you had the latest search open when
> checking the # of docs on each node? There was previously a race around
> commit that could cause some issues around expected visibility.
>
> If you are able to, you might try out a nightly build - 4.1 will be ready
> very soon and has numerous bug fixes for SolrCloud.
>
> - Mark
>
> On Dec 13, 2012, at 9:53 AM, John Nielsen <j...@mcb.dk> wrote:
>
> > Hi all,
> >
> > We are seeing a strange problem on our 2-node solr4 cluster. This problem
> > has resultet in data loss.
> >
> > We have two servers, varnish01 and varnish02. Zookeeper is running on
> > varnish02, but in a separate jvm.
> >
> > We index directly to varnish02 and we read from varnish01. Data is thus
> > replicated from varnish02 to varnish01.
> >
> > I found this in the varnish01 log:
> >
> > *INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> > status=0 QTime=42
> > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> > status=0 QTime=41
> > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> > status=0 QTime=33
> > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2
> }
> > status=0 QTime=33
> > Dec 13, 2012 12:23:39 PM org.apache.solr.common.SolrException log
> > SEVERE: shard update error StdNode:
> >
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/:org.apache.solr.client.solrj.SolrServerException
> :
> > IOException occured when talking to server at:
> > http://varnish02.lynero.net:8000/solr/default1_Norwegian
> >    at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >    at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >    at
> >
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335)
> >    at
> >
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309)
> >    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >    at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >    at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >    at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >    at java.lang.Thread.run(Thread.java:636)
> > Caused by: org.apache.http.NoHttpResponseException: The target server
> > failed to respond
> >    at
> >
> org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:101)
> >    at
> >
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252)
> >    at
> >
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282)
> >    at
> >
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247)
> >    at
> >
> org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216)
> >    at
> >
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298)
> >    at
> >
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
> >    at
> >
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647)
> >    at
> >
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464)
> >    at
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
> >    at
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
> >    at
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
> >    at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
> >    ... 11 more
> >
> > Dec 13, 2012 12:23:39 PM
> > org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
> > INFO: try and ask http://varnish02.lynero.net:8000/solr to recover*
> >
> > It looks like it is sending updates from varnish01 to varnish02. I am not
> > sure for what since we only index on varnish02. Updates should never be
> > going from varnish01 to varnish02.
> >
> > Meanwhile on varnish02:
> >
> > *INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=16
> > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=15
> > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=16
> > Dec 13, 2012 12:23:42 PM org.apache.solr.handler.admin.CoreAdminHandler
> > handleRequestRecoveryAction
> > INFO: It has been requested that we recover*
> > *Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select
> >
> params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> >
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text
> ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> > status=0 QTime=1
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select/
> > params={fq=site_guid:(2810678)&q=win} hits=0 status=0 QTime=17
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select
> >
> params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url=
> >
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&qf=%0a++++++++++text
> ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> > status=0 QTime=1
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select
> >
> params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> >
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text
> ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> > status=0 QTime=1
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select
> >
> params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url=
> >
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&group.topgroups.groupby_variant_of_item_guid=2963217&group.topgroups.groupby_variant_of_item_guid=2963223&group.topgroups.groupby_variant_of_item_guid=2963219&group.topgroups.groupby_variant_of_item_guid=2963220&group.topgroups.groupby_variant_of_item_guid=2963221&group.topgroups.groupby_variant_of_item_guid=2963222&group.topgroups.groupby_variant_of_item_guid=2963224&group.topgroups.groupby_variant_of_item_guid=2963218&qf=%0a++++++++++text
> ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> > status=0 QTime=1
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=26
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=22
> > Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState
> > doRecovery
> > Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState
> > doRecovery
> > INFO: Running recovery - first canceling any ongoing recovery
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=25
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=24
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=20
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=25
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=23
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=21
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=23
> > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Norwegian] webapp=/solr path=/update
> params={distrib.from=
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2
> }
> > status=0 QTime=16
> > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy run
> > INFO: Starting recovery process.  core=default1_Norwegian
> > recoveringAfterStartup=false
> > Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader
> > updateClusterState
> > INFO: Updating cloud state from ZooKeeper...
> > Dec 13, 2012 12:23:42 PM
> > org.apache.solr.update.processor.LogUpdateProcessor finish*
> >
> > And less than a second later:
> >
> > *Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy
> doRecovery
> > INFO: Attempting to PeerSync from
> >
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/core=default1_Norwegian
> > - recoveringAfterStartup=false
> > Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync
> > INFO: PeerSync: core=default1_Norwegian url=
> > http://varnish02.lynero.net:8000/solr START replicas=[
> > http://varnish01.lynero.net:8000/solr/default1_Norwegian/] nUpdates=100
> > Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync
> > WARNING: PeerSync: core=default1_Norwegian url=
> > http://varnish02.lynero.net:8000/solr too many updates received since
> start
> > - startingUpdates no longer overlaps with our currentUpdates
> > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy
> doRecovery
> > INFO: PeerSync Recovery was not successful - trying replication.
> > core=default1_Norwegian
> > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy
> doRecovery
> > INFO: Starting Replication Recovery. core=default1_Norwegian
> > Dec 13, 2012 12:23:42 PM org.apache.solr.client.solrj.impl.HttpClientUtil
> > createClient
> > INFO: Creating new http client,
> > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2
> > process
> > INFO: A cluster state change has occurred - updating...*
> >
> > State change on varnish01 at the same time:
> >
> > *Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2
> > process
> > INFO: A cluster state change has occurred - updating...*
> > *
> > *And a few seconds later on varnish02, the recovery finishes:
> > *
> > Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy
> doRecovery
> > INFO: Replication Recovery was successful - registering as Active.
> > core=default1_Norwegian
> > Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy
> doRecovery
> > INFO: Finished recovery process. core=default1_Norwegian
> > Dec 13, 2012 12:23:48 PM org.apache.solr.core.SolrCore execute
> > INFO: [default1_Danish] webapp=/solr path=/select
> >
> params={facet=false&sort=item_group_56823_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> >
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397828395&group.field=groupby_variant_of_item_guid&facet.field=itemgroups_int_mv&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_56823_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text
> ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> > status=0 QTime=8
> > Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader
> > updateClusterState
> > INFO: Updating cloud state from ZooKeeper... *
> >
> > Which is picked up on varnish01:
> >
> > *Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader$2
> > process
> > INFO: A cluster state change has occurred - updating...*
> >
> > It looks like it replicated successfully, only it didnt. The
> > default1_Norwegian core on varnish01 now has 55.071 docs and the same
> core
> > on varnish02 has 35.088 docs.
> >
> > I checked the log files for both JVM's and no stop-the-world GC were
> taking
> > place.
> >
> > There is also nothing in the zookeeper log of interest that I can see.
> >
> >
> > --
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > p...@mcb.dk
> > www.mcb.dk
>
>

Reply via email to