[jira] [Updated] (GEODE-8687) Durable client is continuously re-registering CQs on all servers when event de-serialization fails causing resource exhaustion on servers

Jakov Varenina (Jira) Wed, 04 Nov 2020 23:33:25 -0800


     [ 
https://issues.apache.org/jira/browse/GEODE-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakov Varenina updated GEODE-8687:
----------------------------------
    Description: 
When ReflectionBasedAutoSerializer is wrongly/not set it results with 
serialization exception on client at the reception of the CQ events. 
Serialization exception isn't logged which is misleading, and is hard to find 
that actually ReflectionBasedAutoSerializer isn't set correctly. Only log that 
can be seen is that client/servers subscription connections are closed due to 
EOF. This is because client destroys subscriptions connections intentionally, 
but doesn't log reason (PdxSerializationException) that led to this. It would 
be good that serialization exceptions are logged as error or warn.

Client destroys subscription connection and perform server fail-over whenever 
serialization issue occurs. Additionally when subscription connection for 
particular server fails multiple times then this server is put in deny list for 
10 seconds (this is configurable with {{ping-interval}}). After 10s expire the 
server is removed from list and it is available for subscription connection 
which will be destroyed again due serialization issue. This will go 
indefinitely and approx. every 10s in this case the client subscribes to each 
servers at least once. Due to serialization issue events aren't sent to client 
and remain in subscription queues.

Whenever connection fails due to serialization issue and client is not durable 
then subscription queue is closed and events are lost.

The biggest problem arises when client is durable. This is because subscription 
queue remains on server for configurable period of time (e.g. 300s) waiting for 
client to reconnect. When client perform fail-over to another server it will 
create new subscription queue using initial image from old queue that is 
currently paused. This means that all events from old queue will be transferred 
to new subscription queue hosted by the current primary server. This will 
happen on all servers and all of them will have copy of the queue even 
subscription redundancy isn't configured. The problem here is that client will 
periodically (every 10s in this case) establish connection to each servers, so 
configured timeout (e.g. 300s) will never expire, but it will be renewed each 
time client is registered. This could cause a lots of problems since memory and 
disk usage (if overflow on queue is configured) will increase on all servers.

You can find in attached logs for the problematic case with durable client :

vm0              -> locator
vm1, vm2   -> servers
vm3              -> durable client with enabled subscription handling CQ events
vm4              -> client generating traffic that should trigger registered CQ
 

  was:
When ReflectionBasedAutoSerializer is wrongly/not set it results with 
serialization exception on client at the reception of the CQ events. 
Serialization exception isn't logged which is misleading, and is hard to find 
that actually ReflectionBasedAutoSerializer isn't set correctly. Only log that 
can be seen is that client/servers subscription connections are closed due to 
EOF. This is because client destroys subscriptions connections intentionally, 
but doesn't log reason (PdxSerializationException) that led to this. It would 
be good that serialization exceptions are logged as error or warn.

Client destroys subscription connection and perform server fail-over whenever 
serialization issue occurs. Additionally when subscription connection for 
particular server fails multiple times then this server is put in deny list for 
10 seconds (this is configurable with {{ping-interval}}). After 10s expire the 
server is removed from list and it is available for subscription connection 
which again fail. This will go indefinitely and approx. every 10s in this case 
the client subscribes to each servers at least once. Due to serialization issue 
events aren't sent to client and remain in subscription queues.

Whenever connection fails due to serialization issue and client is not durable 
then subscription queue is closed and events are lost.

The biggest problem arises when client is durable. This is because subscription 
queue remains on server for configurable period of time (e.g. 300s) waiting for 
client to reconnect. When client perform fail-over to another server it will 
create new subscription queue using initial image from old queue that is 
currently paused. This means that all events from old queue will be transferred 
to new subscription queue hosted by the current primary server. This will 
happen on all servers and all of them will have copy of the queue even 
subscription redundancy isn't configured. The problem here is that client will 
periodically (every 10s in this case) establish connection to each servers, so 
configured timeout (e.g. 300s) will never expire, but it will be renewed each 
time client is registered. This could cause a lots of problems since memory and 
disk usage (if overflow on queue is configured) will increase on all servers.

You can find in attached logs for the problematic case with durable client :

vm0              -> locator
vm1, vm2   -> servers
vm3              -> durable client with enabled subscription handling CQ events
vm4              -> client generating traffic that should trigger registered CQ
 


> Durable client is continuously re-registering CQs on all servers when event 
> de-serialization fails causing resource exhaustion on servers 
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8687
>                 URL: https://issues.apache.org/jira/browse/GEODE-8687
>             Project: Geode
>          Issue Type: Bug
>          Components: client/server
>    Affects Versions: 1.13.0
>            Reporter: Jakov Varenina
>            Priority: Major
>         Attachments: deserialzationFault.log
>
>
> When ReflectionBasedAutoSerializer is wrongly/not set it results with 
> serialization exception on client at the reception of the CQ events. 
> Serialization exception isn't logged which is misleading, and is hard to find 
> that actually ReflectionBasedAutoSerializer isn't set correctly. Only log 
> that can be seen is that client/servers subscription connections are closed 
> due to EOF. This is because client destroys subscriptions connections 
> intentionally, but doesn't log reason (PdxSerializationException) that led to 
> this. It would be good that serialization exceptions are logged as error or 
> warn.
> Client destroys subscription connection and perform server fail-over whenever 
> serialization issue occurs. Additionally when subscription connection for 
> particular server fails multiple times then this server is put in deny list 
> for 10 seconds (this is configurable with {{ping-interval}}). After 10s 
> expire the server is removed from list and it is available for subscription 
> connection which will be destroyed again due serialization issue. This will 
> go indefinitely and approx. every 10s in this case the client subscribes to 
> each servers at least once. Due to serialization issue events aren't sent to 
> client and remain in subscription queues.
> Whenever connection fails due to serialization issue and client is not 
> durable then subscription queue is closed and events are lost.
> The biggest problem arises when client is durable. This is because 
> subscription queue remains on server for configurable period of time (e.g. 
> 300s) waiting for client to reconnect. When client perform fail-over to 
> another server it will create new subscription queue using initial image from 
> old queue that is currently paused. This means that all events from old queue 
> will be transferred to new subscription queue hosted by the current primary 
> server. This will happen on all servers and all of them will have copy of the 
> queue even subscription redundancy isn't configured. The problem here is that 
> client will periodically (every 10s in this case) establish connection to 
> each servers, so configured timeout (e.g. 300s) will never expire, but it 
> will be renewed each time client is registered. This could cause a lots of 
> problems since memory and disk usage (if overflow on queue is configured) 
> will increase on all servers.
> You can find in attached logs for the problematic case with durable client :
> vm0              -> locator
> vm1, vm2   -> servers
> vm3              -> durable client with enabled subscription handling CQ 
> events
> vm4              -> client generating traffic that should trigger registered 
> CQ
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (GEODE-8687) Durable client is continuously re-registering CQs on all servers when event de-serialization fails causing resource exhaustion on servers

Reply via email to