RE: WAN replication issue in cloud native environments

2020-01-27 Thread Alberto Bustamante Reyes
Hi again,

Status update: the simplification of the maps suggested by Jacob made useless 
the new proposed class containing the ServerLocation and the member id. With 
this refactoring, replication is working in the scenario we have been 
discussing in this conversation. Thats great, and I think the code can be 
merged into develop if there are no extra comments in the PR.

But this does not mean we can say that Geode is able to work properly when 
using gw receivers with the same ip + port. We have seen that when working with 
this configuration, there is a problem with the pings sent from gw senders 
(that acts as clients) to the gw receivers (servers). The pings are reaching 
just one of the receivers, so the sender-receiver connection is finally closed 
by the ClientHealthMonitor.

Do you have any suggestion about how to handle this issue? My first idea was to 
identify where the connection is created, to check if the sender could be aware 
in some way there are more than one server to which the ping should be sent, 
but Im not sure if it could be possible. Or if the alternative could be to 
change the ClientHealthMonitor to be "clever" enough to not close connections 
in this case. Any comment is welcome 🙂

Thanks,

Alberto B.


De: Jacob Barrett 
Enviado: miércoles, 22 de enero de 2020 19:01
Para: Alberto Bustamante Reyes 
Cc: dev@geode.apache.org ; Anilkumar Gingade 
; Charlie Black 
Asunto: Re: WAN replication issue in cloud native environments



On Jan 22, 2020, at 9:51 AM, Alberto Bustamante Reyes 
mailto:alberto.bustamante.re...@est.tech>> 
wrote:

Thanks Naba & Jacob for your comments!



@Naba: I have been implementing a solution as you suggested, and I think it 
would be convenient if the client knows the memberId of the server it is 
connected to.

(current code is here: https://github.com/apache/geode/pull/4616 )

For example, in:

LocatorLoadSnapshot::getReplacementServerForConnection(ServerLocation 
currentServer, String group, Set excludedServers)

In this method, client has sent the ServerLocation , but if that object does 
not contain the memberId, I dont see how to guarantee that the replacement that 
will be returned is not the same server the client is currently connected.
Inside that method, this other method is called:


Given that your setup is masquerading multiple members behind the same host and 
port (ServerLocation) it doesn’t matter. When the pool opens a new socket to 
the replacement server it will be to the shared hostname and port and the 
Kubenetes service at that host and port will just pick a backend host. In the 
solution we suggested we preserved that behavior since the k8s service can’t 
determine which backend member to route the connection to based on the member 
id.


LocatorLoadSnapshot::isCurrentServerMostLoaded(currentServer, groupServers)

where groupServers is a "Map" object. If 
the keys of that map have the same host and port, they are only different on 
the memberId. But as you dont know it (you just have currentServer which 
contains host and port), you cannot get the correct LoadHolder value, so you 
cannot know if your server is the most loaded.

Again, given your use case the behavior of this method is lost when a new 
connection is establish by the pool through the shared hostname anyway.

@Jacob: I think the solution finally implies that client have to know the 
memberId, I think we could simplify the maps.

The client isn’t keeping these load maps, the locator is, and the locator knows 
all the member ids. The client end only needs to know the host/port 
combination. In your example where the wan replication (a client to the remote 
cluster) connects to the shared host/port service and get randomly routed to 
one of the backend servers in that service.

All of this locator balancing code is unnecessarily in this model where 
something else is choosing the final destination. The goal of our proposed 
changes was to recognize that all we need is to make sure the locator keeps the 
shared ServerLocation alive in its responses to clients by tracking the members 
associated and reducing that set to the set of unit ServerLocations. In your 
case that will always reduce to 1 ServerLocation for N number of members, as 
long as 1 member is still up.

-Jake




Re: WAN replication issue in cloud native environments

2020-01-27 Thread Jacob Barrett
My initial guess without looking is that the client pool is sending a ping to 
each ServerLocation on only one of the available Connections. This logic should 
be changed to send to each unique member, since ServerLocation is not unique 
anymore.

-Jake


> On Jan 27, 2020, at 8:55 AM, Alberto Bustamante Reyes 
>  wrote:
> 
> 
> Hi again,
> 
> Status update: the simplification of the maps suggested by Jacob made useless 
> the new proposed class containing the ServerLocation and the member id. With 
> this refactoring, replication is working in the scenario we have been 
> discussing in this conversation. Thats great, and I think the code can be 
> merged into develop if there are no extra comments in the PR.
> 
> But this does not mean we can say that Geode is able to work properly when 
> using gw receivers with the same ip + port. We have seen that when working 
> with this configuration, there is a problem with the pings sent from gw 
> senders (that acts as clients) to the gw receivers (servers). The pings are 
> reaching just one of the receivers, so the sender-receiver connection is 
> finally closed by the ClientHealthMonitor.
> 
> Do you have any suggestion about how to handle this issue? My first idea was 
> to identify where the connection is created, to check if the sender could be 
> aware in some way there are more than one server to which the ping should be 
> sent, but Im not sure if it could be possible. Or if the alternative could be 
> to change the ClientHealthMonitor to be "clever" enough to not close 
> connections in this case. Any comment is welcome 🙂
> 
> Thanks,
> 
> Alberto B.
>   
> De: Jacob Barrett 
> Enviado: miércoles, 22 de enero de 2020 19:01
> Para: Alberto Bustamante Reyes 
> Cc: dev@geode.apache.org ; Anilkumar Gingade 
> ; Charlie Black 
> Asunto: Re: WAN replication issue in cloud native environments
>  
> 
> 
>>> On Jan 22, 2020, at 9:51 AM, Alberto Bustamante Reyes 
>>>  wrote:
>>> 
>>> Thanks Naba & Jacob for your comments!
>>> 
>>> 
>>> 
>>> @Naba: I have been implementing a solution as you suggested, and I think it 
>>> would be convenient if the client knows the memberId of the server it is 
>>> connected to.
>>> 
>>> (current code is here: https://github.com/apache/geode/pull/4616 )
>>> 
>>> For example, in:
>>> 
>>> LocatorLoadSnapshot::getReplacementServerForConnection(ServerLocation 
>>> currentServer, String group, Set excludedServers)
>>> 
>>> In this method, client has sent the ServerLocation , but if that object 
>>> does not contain the memberId, I dont see how to guarantee that the 
>>> replacement that will be returned is not the same server the client is 
>>> currently connected.
>>> Inside that method, this other method is called:
>> 
>> 
>> Given that your setup is masquerading multiple members behind the same host 
>> and port (ServerLocation) it doesn’t matter. When the pool opens a new 
>> socket to the replacement server it will be to the shared hostname and port 
>> and the Kubenetes service at that host and port will just pick a backend 
>> host. In the solution we suggested we preserved that behavior since the k8s 
>> service can’t determine which backend member to route the connection to 
>> based on the member id.
>> 
>> 
>> LocatorLoadSnapshot::isCurrentServerMostLoaded(currentServer, groupServers)
>> 
>> where groupServers is a "Map" object. 
>> If the keys of that map have the same host and port, they are only different 
>> on the memberId. But as you dont know it (you just have currentServer which 
>> contains host and port), you cannot get the correct LoadHolder value, so you 
>> cannot know if your server is the most loaded.
> 
> Again, given your use case the behavior of this method is lost when a new 
> connection is establish by the pool through the shared hostname anyway. 
> 
>> @Jacob: I think the solution finally implies that client have to know the 
>> memberId, I think we could simplify the maps.
> 
> The client isn’t keeping these load maps, the locator is, and the locator 
> knows all the member ids. The client end only needs to know the host/port 
> combination. In your example where the wan replication (a client to the 
> remote cluster) connects to the shared host/port service and get randomly 
> routed to one of the backend servers in that service.
> 
> All of this locator balancing code is unnecessarily in this model where 
> something else is choosing the final destination. The goal of our proposed 
> changes was to recognize that all we need is to make sure the locator keeps 
> the shared ServerLocation alive in its responses to clients by tracking the 
> members associated and reducing that set to the set of unit ServerLocations. 
> In your case that will always reduce to 1 ServerLocation for N number of 
> members, as long as 1 member is still up.
> 
> -Jake
> 
> 


Re: Odg: ParallelGatewaySenderQueue implementation

2020-01-27 Thread Dan Smith
Hi Mario,

That bug number is from an old version of GemFire before it was open
sourced as geode.

Looking at some of the old bug info, it looks like the bug had to do with
the fact that calling stop on the region was causing there to be unexpected
RegionDestroyedException's to be thrown when the queue was stopped *on one
member*. Now that we have "gfsh stop" to stop the queue everywhere, it's
not clear to me that closing the region would be a problem - it seems like
the right thing to do if that will make the behavior more consistent with
serial senders.

-Dan

On Fri, Jan 24, 2020 at 2:39 AM Mario Ivanac  wrote:

> Hi geode dev,
>
> Do you know more info regarding this bug  49060,
> because I think this the cause of issue
> https://issues.apache.org/jira/browse/GEODE-7441.
>
> When closing of region is returned (at stoping of parallel GW sender),
> persistent parallel GW sender queue is restored after restart.
>
> BR,
> Mario
> 
> Å alje: Mario Ivanac
> Poslano: 11. studenog 2019. 13:29
> Prima: dev@geode.apache.org 
> Predmet: ParallelGatewaySenderQueue implementation
>
> Hi geode dev,
>
> I am investigating SerialGatewaySenderQueue and ParallelGatewaySenderQueue
> implementation,
>
> and I found that in ParallelGatewaySenderQueue.close() function,
> code is deleted and comment is left:
>
> // Because of bug 49060 do not close the regions of a parallel queue
>
> My question is, where can I find some more info regarding this bug.
>
> BR,
> Mario
>


Re: Odg: ParallelGatewaySenderQueue implementation

2020-01-27 Thread Jason Huynh
Some additional info/context from the PR that is blocked by this issue:

Although we have GFSH stop, it can still be used on an individual node.  We
just publish a caution, but it looks like we still allowed it due to having
some users using it:

CAUTION: Use caution with the stop gateway-sender command (or equivalent
GatewaySender.stop() API) on parallel gateway senders. Instead of stopping
an individual parallel gateway sender on a member, we recommend shutting
down the entire member to ensure that proper failover of partition region
events to other gateway sender members. Using this command on an individual
parallel gateway sender can occur in event loss. See Stopping Gateway
Senders for more details.

There were some issues with the PR(https://github.com/apache/geode/pull/4387)
when close is implemented.  It doesn't allow a single sender to be shut
down on a node.  I do know of some users that rely on this behavior,
whether they should be able to or not, they have used this in the past
(which is why we added the test
shuttingOneSenderInAVMShouldNotAffectOthersBatchRemovalThread)

The close in combination with stopping gateways senders can cause odd
issues, like PartitionedOfflineExceptions, RegionDestroyedExceptions or
behavior like this test is exhibiting. We have some internal applications
that are running into these types of issues with this diff as well.



On Mon, Jan 27, 2020 at 10:09 AM Dan Smith  wrote:

> Hi Mario,
>
> That bug number is from an old version of GemFire before it was open
> sourced as geode.
>
> Looking at some of the old bug info, it looks like the bug had to do with
> the fact that calling stop on the region was causing there to be unexpected
> RegionDestroyedException's to be thrown when the queue was stopped *on one
> member*. Now that we have "gfsh stop" to stop the queue everywhere, it's
> not clear to me that closing the region would be a problem - it seems like
> the right thing to do if that will make the behavior more consistent with
> serial senders.
>
> -Dan
>
> On Fri, Jan 24, 2020 at 2:39 AM Mario Ivanac 
> wrote:
>
> > Hi geode dev,
> >
> > Do you know more info regarding this bug  49060,
> > because I think this the cause of issue
> > https://issues.apache.org/jira/browse/GEODE-7441.
> >
> > When closing of region is returned (at stoping of parallel GW sender),
> > persistent parallel GW sender queue is restored after restart.
> >
> > BR,
> > Mario
> > 
> > Å alje: Mario Ivanac
> > Poslano: 11. studenog 2019. 13:29
> > Prima: dev@geode.apache.org 
> > Predmet: ParallelGatewaySenderQueue implementation
> >
> > Hi geode dev,
> >
> > I am investigating SerialGatewaySenderQueue and
> ParallelGatewaySenderQueue
> > implementation,
> >
> > and I found that in ParallelGatewaySenderQueue.close() function,
> > code is deleted and comment is left:
> >
> > // Because of bug 49060 do not close the regions of a parallel queue
> >
> > My question is, where can I find some more info regarding this bug.
> >
> > BR,
> > Mario
> >
>