Dag Wanvik <[email protected]> writes:

> On 11.06.2013 18:50, benrahman wrote:
>
>     /Master derby.log/
>     
>     ----  BEGIN REPLICATION ERROR MESSAGE (6/5/13 3:35 PM) ----
>     Exception occurred during log shipping.
>     java.net.SocketException: Connection reset by peer: socket write error
>             at java.net.SocketOutputStream.socketWrite0(Native Method)
>             at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>             at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>
> Looks like the socket the master uses to ship records to slave stopped 
> working; hard to say what's the issue here. Do you see anything
> in the slave's log file at this time instant?
>
> Later replication error messages in the master's log file show that the 
> buffer grows full (since it can't send):
>
>> ----  BEGIN REPLICATION ERROR MESSAGE (6/6/13 5:46 PM) ----
>> Exception occurred during log shipping.
>> org.apache.derby.impl.store.replication.buffer.LogBufferFullException
>>       at
>> org.apache.derby.impl.store.replication.buffer.ReplicationLogBuffer.switchDirtyBuffer(Unknown
>
> Not sure why the slave doesn't fail over; maybe the master process needs to 
> be stopped (crash) before it will happen..
> It is probably right that it doesn't happen when you first see the socket 
> write error; it could be due to a intermittent network error.

That's right. It is supposed to try to reconnect until there's no more
space in the replication log buffers, according to
http://db.apache.org/derby/docs/10.10/adminguide/cadminreplicfailures.html.

> But I believe the slave and master have a keep-alive protocol to enable the 
> slave to fail over when the master is not longer seen to be
> alive.

I think the slave never fails over automatically, even if it detects
that it has lost contact with the master. It has to be told to do so.
See http://db.apache.org/derby/docs/10.10/adminguide/cadminreplicfailover.html,
which says:

  There is no automatic failover or restart of replication after one of
  the instances has failed.


-- 
Knut Anders

Reply via email to