I did a manual commit, and we are still missing docs, so it doesn't look like the search race condition you mention.
My boss wasn't happy when i mentioned that I wanted to try out unreleased code. Ill get him won over though and return with my findings. It will probably be some time next week. Thanks for your help. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Dec 13, 2012 at 4:10 PM, Mark Miller <markrmil...@gmail.com> wrote: > Couple things to start: > > By default SolrCloud distributes updates a doc at a time. So if you have 1 > shard, whatever node you index too, it will send updates to the other. > Replication is only used for recovery, not distributing data. So for some > reason, there is an IOException when it tries to forward. > > The other issue is not something that Ive seen reported. Can/did you try > and do another hard commit to make sure you had the latest search open when > checking the # of docs on each node? There was previously a race around > commit that could cause some issues around expected visibility. > > If you are able to, you might try out a nightly build - 4.1 will be ready > very soon and has numerous bug fixes for SolrCloud. > > - Mark > > On Dec 13, 2012, at 9:53 AM, John Nielsen <j...@mcb.dk> wrote: > > > Hi all, > > > > We are seeing a strange problem on our 2-node solr4 cluster. This problem > > has resultet in data loss. > > > > We have two servers, varnish01 and varnish02. Zookeeper is running on > > varnish02, but in a separate jvm. > > > > We index directly to varnish02 and we read from varnish01. Data is thus > > replicated from varnish02 to varnish01. > > > > I found this in the varnish01 log: > > > > *INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2 > } > > status=0 QTime=42 > > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2 > } > > status=0 QTime=41 > > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2 > } > > status=0 QTime=33 > > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2 > } > > status=0 QTime=33 > > Dec 13, 2012 12:23:39 PM org.apache.solr.common.SolrException log > > SEVERE: shard update error StdNode: > > > http://varnish02.lynero.net:8000/solr/default1_Norwegian/:org.apache.solr.client.solrj.SolrServerException > : > > IOException occured when talking to server at: > > http://varnish02.lynero.net:8000/solr/default1_Norwegian > > at > > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) > > at > > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > > at > > > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335) > > at > > > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309) > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > > at java.lang.Thread.run(Thread.java:636) > > Caused by: org.apache.http.NoHttpResponseException: The target server > > failed to respond > > at > > > org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:101) > > at > > > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252) > > at > > > org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282) > > at > > > org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247) > > at > > > org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216) > > at > > > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298) > > at > > > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > > at > > > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647) > > at > > > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464) > > at > > > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) > > at > > > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) > > at > > > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) > > at > > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) > > ... 11 more > > > > Dec 13, 2012 12:23:39 PM > > org.apache.solr.update.processor.DistributedUpdateProcessor doFinish > > INFO: try and ask http://varnish02.lynero.net:8000/solr to recover* > > > > It looks like it is sending updates from varnish01 to varnish02. I am not > > sure for what since we only index on varnish02. Updates should never be > > going from varnish01 to varnish02. > > > > Meanwhile on varnish02: > > > > *INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=16 > > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=15 > > Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=16 > > Dec 13, 2012 12:23:42 PM org.apache.solr.handler.admin.CoreAdminHandler > > handleRequestRecoveryAction > > INFO: It has been requested that we recover* > > *Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select > > > params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= > > > varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text > ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} > > status=0 QTime=1 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select/ > > params={fq=site_guid:(2810678)&q=win} hits=0 status=0 QTime=17 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select > > > params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url= > > > varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&qf=%0a++++++++++text > ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} > > status=0 QTime=1 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select > > > params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= > > > varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text > ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} > > status=0 QTime=1 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select > > > params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url= > > > varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&group.topgroups.groupby_variant_of_item_guid=2963217&group.topgroups.groupby_variant_of_item_guid=2963223&group.topgroups.groupby_variant_of_item_guid=2963219&group.topgroups.groupby_variant_of_item_guid=2963220&group.topgroups.groupby_variant_of_item_guid=2963221&group.topgroups.groupby_variant_of_item_guid=2963222&group.topgroups.groupby_variant_of_item_guid=2963224&group.topgroups.groupby_variant_of_item_guid=2963218&qf=%0a++++++++++text > ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} > > status=0 QTime=1 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=26 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=22 > > Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState > > doRecovery > > Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState > > doRecovery > > INFO: Running recovery - first canceling any ongoing recovery > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=25 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=24 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=20 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=25 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=23 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=21 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=23 > > Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Norwegian] webapp=/solr path=/update > params={distrib.from= > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2 > } > > status=0 QTime=16 > > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy run > > INFO: Starting recovery process. core=default1_Norwegian > > recoveringAfterStartup=false > > Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader > > updateClusterState > > INFO: Updating cloud state from ZooKeeper... > > Dec 13, 2012 12:23:42 PM > > org.apache.solr.update.processor.LogUpdateProcessor finish* > > > > And less than a second later: > > > > *Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Attempting to PeerSync from > > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/core=default1_Norwegian > > - recoveringAfterStartup=false > > Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync > > INFO: PeerSync: core=default1_Norwegian url= > > http://varnish02.lynero.net:8000/solr START replicas=[ > > http://varnish01.lynero.net:8000/solr/default1_Norwegian/] nUpdates=100 > > Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync > > WARNING: PeerSync: core=default1_Norwegian url= > > http://varnish02.lynero.net:8000/solr too many updates received since > start > > - startingUpdates no longer overlaps with our currentUpdates > > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: PeerSync Recovery was not successful - trying replication. > > core=default1_Norwegian > > Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Starting Replication Recovery. core=default1_Norwegian > > Dec 13, 2012 12:23:42 PM org.apache.solr.client.solrj.impl.HttpClientUtil > > createClient > > INFO: Creating new http client, > > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > > Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2 > > process > > INFO: A cluster state change has occurred - updating...* > > > > State change on varnish01 at the same time: > > > > *Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2 > > process > > INFO: A cluster state change has occurred - updating...* > > * > > *And a few seconds later on varnish02, the recovery finishes: > > * > > Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Replication Recovery was successful - registering as Active. > > core=default1_Norwegian > > Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy > doRecovery > > INFO: Finished recovery process. core=default1_Norwegian > > Dec 13, 2012 12:23:48 PM org.apache.solr.core.SolrCore execute > > INFO: [default1_Danish] webapp=/solr path=/select > > > params={facet=false&sort=item_group_56823_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= > > > varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397828395&group.field=groupby_variant_of_item_guid&facet.field=itemgroups_int_mv&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_56823_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text > ^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} > > status=0 QTime=8 > > Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader > > updateClusterState > > INFO: Updating cloud state from ZooKeeper... * > > > > Which is picked up on varnish01: > > > > *Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader$2 > > process > > INFO: A cluster state change has occurred - updating...* > > > > It looks like it replicated successfully, only it didnt. The > > default1_Norwegian core on varnish01 now has 55.071 docs and the same > core > > on varnish02 has 35.088 docs. > > > > I checked the log files for both JVM's and no stop-the-world GC were > taking > > place. > > > > There is also nothing in the zookeeper log of interest that I can see. > > > > > > -- > > Med venlig hilsen / Best regards > > > > *John Nielsen* > > Programmer > > > > > > > > *MCB A/S* > > Enghaven 15 > > DK-7500 Holstebro > > > > Kundeservice: +45 9610 2824 > > p...@mcb.dk > > www.mcb.dk > >