from:"Alberto Gomez"

JIRA and Wiki permissions

2019-04-11 Thread Alberto Gomez

Hi,


Could you please grant me permissions to edit the wiki and assign JIRA tickets?


My id is alberto.gomez


Thanks!


Alberto

Geode self-protection about overload

2019-05-10 Thread Alberto Gomez

Hi Geode community!

I'd like to know if Geode implements any kind of self-protection against 
overload. What I mean by this is some mechanism that allows Geode servers (and 
possibly locators) to reject incoming operations before processing them when it 
detects that it is not able to handle the amount of operations received in a 
reasonable way (with reasonable latency and without experiencing processes 
crashing).

The goal would be to make sure that Geode (or some parts of it) do not crash 
under too heavy load and also that the latency level is always under control at 
least for the amount of traffic the Geode cluster is supposed to support.

If Geode does not offer such mechanism, I would also like to get your opinion 
about this possible feature, (if you find it interesting) and also on how it 
could be implemented. One possible approach could be having some measure of the 
current CPU consumption that allows to decide if a given operation must be 
processed or not, taking into account the CPU consumption value with respect to 
an overload threshold.

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

2019-05-13 Thread Alberto Gomez

Hi Anthony!

Thanks a lot for your prompt answer.

I think it is great that Geode can preserve the availability and predictable 
low latency of the cluster when some members are unresponsive by means of the 
GMS.

My question was more targeted to situations in which the load received by the 
cluster is so high that all members struggle to offer low latency. Under such 
circumstances, does Geode take any action to back-off some of the incoming load?

Thanks in advance,

Alberto


On 10/5/19 17:52, Anthony Baker wrote:

Hi Alberto!

Great questions.  One of the fundamental characteristics of Geode is its Group 
Membership System (GMS).  You can read more about it here [1].  The membership 
system ensures that failures due to unresponsive members and/or network 
partitions are detected quickly.  Given that we use synchronous replication for 
consistent updates, the GMS algorithms fence off unresponsive members to 
preserve the availability (and predictable low latency) of the cluster as a 
whole.

Another factor of resilience is memory load.  Regions can be configured to 
automatically evict data to disk based on heap usage.  In addition, when a 
Region exceeds a critical memory usage thresholds further updates are blocked 
until the overload is resolved.

Geode clients route operations to cluster members based on connection load.  
This helps balance cpu load across the entire cluster.  Cluster members can set 
connection maximums to prevent overrunning the available capacity of an 
individual server.

I hope this helps and feel free to keep asking questions :-)

Anthony

[1] 
https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts
 
<https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>




On May 10, 2019, at 3:22 AM, Alberto Gomez  wrote:

Hi Geode community!

I'd like to know if Geode implements any kind of self-protection against 
overload. What I mean by this is some mechanism that allows Geode servers (and 
possibly locators) to reject incoming operations before processing them when it 
detects that it is not able to handle the amount of operations received in a 
reasonable way (with reasonable latency and without experiencing processes 
crashing).

The goal would be to make sure that Geode (or some parts of it) do not crash 
under too heavy load and also that the latency level is always under control at 
least for the amount of traffic the Geode cluster is supposed to support.

If Geode does not offer such mechanism, I would also like to get your opinion 
about this possible feature, (if you find it interesting) and also on how it 
could be implemented. One possible approach could be having some measure of the 
current CPU consumption that allows to decide if a given operation must be 
processed or not, taking into account the CPU consumption value with respect to 
an overload threshold.

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

2019-05-22 Thread Alberto Gomez

Hi Anthony,

Thanks again for the information.

I have played a bit with the the client timeouts and retries and have seen
operations being rejected when load is high due to get or put operations.
Nevertheless, I have not seen that happen when the load in the server is high
due to functions invoked. Is there a reason for not seeing errors with
functions or is it just that my test was not good to hit the limits? What if
queries are sent with OQL? Do the timeout and retries apply? Is there a similar
protection on the native C++ API?

I'd be willing to contribute to the improvements you mention. Do you already
have ideas? Anything written down?

/Alberto

On 14/5/19 17:01, Anthony Baker wrote:

The primary load limiter between the client tier and the Geode servers is via
the max connections limit as noted in this writeup:

https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode
<https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>

When the load is sufficiently high, operations may timeout and a geode client
will failover to less loaded servers. You can limit the number of retries the
client will attempt (each gated by a read timeout) and thus slow down incoming
operations.

We’re looking into some improvements in the client connection pool to improve
both performance and behaviors at the ragged edge when resources are saturated.
Contributions welcome!

Anthony

On May 13, 2019, at 9:02 AM, Alberto Gomez wrote:

Hi Anthony!

Thanks a lot for your prompt answer.

I think it is great that Geode can preserve the availability and predictable
low latency of the cluster when some members are unresponsive by means of the
GMS.

My question was more targeted to situations in which the load received by the
cluster is so high that all members struggle to offer low latency. Under such
circumstances, does Geode take any action to back-off some of the incoming load?

Thanks in advance,

Alberto

On 10/5/19 17:52, Anthony Baker wrote:

Hi Alberto!

Great questions. One of the fundamental characteristics of Geode is its Group
Membership System (GMS). You can read more about it here [1]. The membership
system ensures that failures due to unresponsive members and/or network
partitions are detected quickly. Given that we use synchronous replication for
consistent updates, the GMS algorithms fence off unresponsive members to
preserve the availability (and predictable low latency) of the cluster as a
whole.

Another factor of resilience is memory load. Regions can be configured to
automatically evict data to disk based on heap usage. In addition, when a
Region exceeds a critical memory usage thresholds further updates are blocked
until the overload is resolved.

Geode clients route operations to cluster members based on connection load.
This helps balance cpu load across the entire cluster. Cluster members can set
connection maximums to prevent overrunning the available capacity of an
individual server.

I hope this helps and feel free to keep asking questions :-)

Anthony

[1]
https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts

<https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts><https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts

<https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>>

On May 10, 2019, at 3:22 AM, Alberto Gomez wrote:

Hi Geode community!

I'd like to know if Geode implements any kind of self-protection against
overload. What I mean by this is some mechanism that allows Geode servers (and
possibly locators) to reject incoming operations before processing them when it
detects that it is not able to handle the amount of operations received in a
reasonable way (with reasonable latency and without experiencing processes
crashing).

The goal would be to make sure that Geode (or some parts of it) do not crash
under too heavy load and also that the latency level is always under control at
least for the amount of traffic the Geode cluster is supposed to support.

If Geode does not offer such mechanism, I would also like to get your opinion
about this possible feature, (if you find it interesting) and also on how it
could be implemented. One possible approach could be having some measure of the
current CPU consumption that allows to decide if a given operation must be
processed or not, taking into account the CPU consumption value with respect to
an overload threshold.

Thanks in advance for your answers,

-Alberto

Re: Geode self-protection about overload

2019-06-05 Thread Alberto Gomez

Hi again,

I finally figured out why I was not getting the 
"ServerConnectivityException" when executing a big amount of functions 
in Geode while I did get the exception when running lots of 
gets/puts/queries.

The reason is that the ConnectionImpl::execute(Op op) does not use the 
timeout set by PoolFactory::setReadTimeout(int timeout) when the 
operation is a function. Instead, it uses the timeout set by the 
following System property: gemfire.CLIENT_FUNCTION_TIMEOUT.

Do you see value in adding a method to the PoolFactory as well as to the 
ClientCacheFactory to set this timeout for functions?

How about being able to override this timeout on each function 
invocation by adding a setReadTimeout method to the FunctionService 
interface?

/Alberto


On 22/5/19 18:03, Alberto Gomez wrote:
> Hi Anthony,
>
> Thanks again for the information.
>
> I have played a bit with the the client timeouts and retries and have 
> seen operations being rejected when load is high due to get or put 
> operations. Nevertheless, I have not seen that happen when the load in 
> the server is high due to functions invoked. Is there a reason for not 
> seeing errors with functions or is it just that my test was not good 
> to hit the limits? What if queries are sent with OQL? Do the timeout 
> and retries apply? Is there a similar protection on the native C++ API?
>
> I'd be willing to contribute to the improvements you mention. Do you 
> already have ideas? Anything written down?
>
> /Alberto
>
>
> On 14/5/19 17:01, Anthony Baker wrote:
>> The primary load limiter between the client tier and the Geode servers is 
>> via the max connections limit as noted in this writeup:
>>
>> https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode
>>  
>> <https://cwiki.apache.org/confluence/display/GEODE/Resource+Management+in+Geode>
>>
>> When the load is sufficiently high, operations may timeout and a geode 
>> client will failover to less loaded servers.  You can limit the number of 
>> retries the client will attempt (each gated by a read timeout) and thus slow 
>> down incoming operations.
>>
>> We’re looking into some improvements in the client connection pool to 
>> improve both performance and behaviors at the ragged edge when resources are 
>> saturated.  Contributions welcome!
>>
>> Anthony
>>
>>
>>> On May 13, 2019, at 9:02 AM, Alberto Gomez  wrote:
>>>
>>> Hi Anthony!
>>>
>>> Thanks a lot for your prompt answer.
>>>
>>> I think it is great that Geode can preserve the availability and 
>>> predictable low latency of the cluster when some members are unresponsive 
>>> by means of the GMS.
>>>
>>> My question was more targeted to situations in which the load received by 
>>> the cluster is so high that all members struggle to offer low latency. 
>>> Under such circumstances, does Geode take any action to back-off some of 
>>> the incoming load?
>>>
>>> Thanks in advance,
>>>
>>> Alberto
>>>
>>>
>>> On 10/5/19 17:52, Anthony Baker wrote:
>>>
>>> Hi Alberto!
>>>
>>> Great questions.  One of the fundamental characteristics of Geode is its 
>>> Group Membership System (GMS).  You can read more about it here [1].  The 
>>> membership system ensures that failures due to unresponsive members and/or 
>>> network partitions are detected quickly.  Given that we use synchronous 
>>> replication for consistent updates, the GMS algorithms fence off 
>>> unresponsive members to preserve the availability (and predictable low 
>>> latency) of the cluster as a whole.
>>>
>>> Another factor of resilience is memory load.  Regions can be configured to 
>>> automatically evict data to disk based on heap usage.  In addition, when a 
>>> Region exceeds a critical memory usage thresholds further updates are 
>>> blocked until the overload is resolved.
>>>
>>> Geode clients route operations to cluster members based on connection load. 
>>>  This helps balance cpu load across the entire cluster.  Cluster members 
>>> can set connection maximums to prevent overrunning the available capacity 
>>> of an individual server.
>>>
>>> I hope this helps and feel free to keep asking questions :-)
>>>
>>> Anthony
>>>
>>> [1] 
>>> https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts
>>>  
>>> <https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts>

Review of pull request associated with GEODE-6798

2019-06-18 Thread Alberto Gomez

Hi,

Could someone review the following PR: 
https://github.com/apache/geode/pull/3710?

Thanks in advance,

Alberto

Problem with LGTM on geode-native pull request

2019-07-30 Thread Alberto Gomez

Hi,

I am getting a failure on the C/C++ LGTM analysis over a recently 
created pull request on geode-native: 
https://github.com/apache/geode-native/pull/504

I have noticed that this it the first PR on geode-native having LGTM 
analysis.

There is a .lgtm.yml on the repo that does not seem up to date as it 
references Apache 1.7.0. This could be the source of the problem.

Any ideas?

Thanks in advance,

-Alberto G.

Re: Problem with LGTM on geode-native pull request

2019-07-31 Thread Alberto Gomez

Hi,

I updated the .lgtm.yml file so that the right version of Apache Geode 
is downloaded and now I get errors in the linking process about some 
symbols not found.

Could anybody help with this?

Thanks,

Alberto

On 30/7/19 15:50, Jacob Barrett wrote:
> Yes, that seems to be the issue. Can you update the lgtm.yml and push to your 
> branch.
>
> Thanks,
> Jake
>
>
>> On Jul 30, 2019, at 5:09 AM, Alberto Gomez  wrote:
>>
>> Hi,
>>
>> I am getting a failure on the C/C++ LGTM analysis over a recently
>> created pull request on geode-native:
>> https://github.com/apache/geode-native/pull/504
>>
>> I have noticed that this it the first PR on geode-native having LGTM
>> analysis.
>>
>> There is a .lgtm.yml on the repo that does not seem up to date as it
>> references Apache 1.7.0. This could be the source of the problem.
>>
>> Any ideas?
>>
>> Thanks in advance,
>>
>> -Alberto G.
>>

Re: Problem with LGTM on geode-native pull request

2019-08-01 Thread Alberto Gomez

Hi,

I would not back out the LGTM changes added in the PR as they are necessary.

- Alberto

On 31/7/19 23:46, Jacob Barrett wrote:
> I would say for this PR, back out the LGTM changes and just move forward 
> ignoring the LGTM results.

Re: Problem with LGTM on geode-native pull request

2019-08-01 Thread Alberto Gomez

Having put it this way, I agree with you guys ;-)

Thanks,

-Alberto

On 1/8/19 18:02, Blake Bender wrote:
> I agree with Jake on this one.  From a bookkeeping perspective, what I'd
> like to see in the history is a single commit that fixes all the LGTM
> issues, and your fix for this bug in a separate commit.  I have a copy of
> your .yml changes on my "fix LGTM" branch already, please back that change
> out and we can merge your PR without LGTM passing.
>
> Thanks,
>
> Blake
>
>
> On Thu, Aug 1, 2019 at 12:55 AM Alberto Gomez 
> wrote:
>
>> Hi,
>>
>> I would not back out the LGTM changes added in the PR as they are
>> necessary.
>>
>> - Alberto
>>
>> On 31/7/19 23:46, Jacob Barrett wrote:
>>> I would say for this PR, back out the LGTM changes and just move forward
>> ignoring the LGTM results.
>>

Reviewers wanted for GEODE-7019 and GEODE-7049

2019-08-09 Thread Alberto Gomez

Hi,

I would need some extra reviewers for the following PRs:

GEODE-7019 Idle connections are never closed by native C++ client library

GEODE-7049 Add timeout parameter to Java native client 
Execution::execute() methods

Any volunteers?

Thanks,

Alberto

Rat check reporting error in travis-ci for geode-native

2019-08-20 Thread Alberto Gomez

Hi,

I sent a PR for the geode-native repo and the travis-ci is reporting a 
Rat error on the following file:

./packer/windows/install-doxygen.ps1

The reason is that the copyright header is not included in it. I could 
add it in my PR but I guess it would be more convenient if this is fixed 
on a independent commit by someone working on it.

Could anyone take care of it?

Thanks,

- Alberto G.

[DISCUSS] Improvements on client function execution API

2019-08-21 Thread Alberto Gomez

Hi,

I have just added the following proposal in the wiki for discussion and would 
like to get feedback from the community.

https://cwiki.apache.org/confluence/display/GEODE/%5BDiscussion%5D+Improvements+on+client+Function+execution+API

Problem

The client API for function execution is inconsistent in the following aspects:

1.Read timeout in function execution client API

The client API for Function execution allows to set a timeout to wait for the 
execution of a function.

Setting this timeout has the effect of setting a read timeout in the socket. If 
the timeout expires during the execution of a function before data has been 
received, the connection is closed and if the number of retries is reached, the 
execute method throws an exception.

Nevertheless, how this timeout is set is not uniform across the different 
clients:

  *   In the native C++ and C# clients, the timeout can be set on a per 
function execution basis by passing the timeout to the Execution::execute() 
method.
  *   In the Java API, the timeout can only be set globally by a system 
property when starting the client process.

2. Timeout in ResultCollector::getResult()

Apart from the timeout on the function execution, the client API offers the 
possibility of setting a timeout on the getResult() method of the 
ResultCollector object returned by the Execution::execute() method.

Given that this method is blocking when invoked from a client (until all 
results have been received) then the setting of this timeout has no effect at 
all. In fact, the DefaultResultCollector in the Java API just ignores the value 
of the timeout.

Note that this timeout in the ResultCollector::getResult() method is useful 
when used inside a peer as function invocations are not blocking.

3. Blocking  Execution::execute()

As mentioned above the Execution::execute() method behavior is different when 
invoked from clients than from peers. When invoked from clients it is blocking 
until all results are received while when invoked from peers it is non-blocking 
and the ResultCollector::getResult() method is used to wait to get all the 
results.

This is not explicit in the documentation of the interface and it has already 
been captured in the following ticket: 
https://issues.apache.org/jira/browse/GEODE-3817

Anti-Goals

-

Solution

In order to make the API more coherent two actions are proposed:

1. Read timeout in function execution Java client API

Add two new Execution::execute() methods in the Java client to offer the 
possibility to set the timeout for the socket just as it is done in the C++ and 
C# clients.

This action is simple to implement and there are no adverse effects attached to 
it.

2. Timeout in ResultCollector::getResult() and Execution::execute() blocking

Regarding the timeout in the ResultCollector::getResult() method problem and 
the blocking/non-blocking confusion for Execution::execute() two alternatives 
are considered:

a) Remove the possibility of setting a timeout on the 
ResultCollector::getResult() method on the client side as with the current 
client implementation it is useless. This could be done by removing the method 
with the timeout parameter from the public API.

It would be advisable to make explicit in the documentation that the 
getResult() method does not wait for results to arrive as that should have 
already been done in the Execution::execute() invocation.

This alternative is very simple and would keep things pretty much as they are 
today.

b) Transform the Execution::execute() method on the client side into a 
non-blocking method.

This alternative is more complex and requires changes in all the clients. Apart 
from that it has implications on the public client API it requires moving the 
exceptions offered currently by the Execution::execute() method to the 
ResultCollector::getResult() and new threads will have to be managed.

An outline of a possible implementation for option b) would be:

  *   Instead of invoking the ServerRegionProxy::executeFunction() directly as 
it is done today, create a Future that invokes this method and returns the 
resultCollector passed as parameter.
  *   Create a new class (ProxyResultCollector) that will hold a Future for a 
ResultCollector and whose getResult() methods implementation would be something 
like:

return this.future.get().getResult();

  *   After creating the future that invokes 
ServerRegionFunction::executeFunction() create an instance of 
ProxyResultCollector and call the setFuture() method on it passing the just 
created future and return it.

Changes and Additions to Public Interfaces

Regarding the first action, the Java client API would grow by adding the 
following two methods:

  *

ResultCollector execute(String functionId, long timeout, TimeUnit 
unit) throws FunctionException;

  *

ResultCollector execute(Function function, long timeout, TimeUnit 
unit) throws FunctionException;

Depending on the option selected on the second action:

  *   F

Re: [DISCUSS] Improvements on client function execution API

2019-08-22 Thread Alberto Gomez

Hi,

Please, see my answers below.

- Alberto

On 21/8/19 20:30, Anilkumar Gingade wrote:
> Just to be clear between java and native-client api:
>
> - Read timeout in function execution Java client API - This is to change
> the java client behavior
Yes
>
> And following are the native client problems and solution applies to
> native-client?
> - Timeout in ResultCollector::getResult() and Execution::execute() blocking
This applies to all the clients, both to the Java and the native one 
(C++ and C# actually).
>
> -Anil.
>
>
> On Wed, Aug 21, 2019 at 8:49 AM Alberto Gomez 
> wrote:
>
>> Hi,
>>
>> I have just added the following proposal in the wiki for discussion and
>> would like to get feedback from the community.
>>
>>
>> https://cwiki.apache.org/confluence/display/GEODE/%5BDiscussion%5D+Improvements+on+client+Function+execution+API
>>
>> Problem
>>
>> The client API for function execution is inconsistent in the following
>> aspects:
>>
>> 1.Read timeout in function execution client API
>>
>> The client API for Function execution allows to set a timeout to wait for
>> the execution of a function.
>>
>> Setting this timeout has the effect of setting a read timeout in the
>> socket. If the timeout expires during the execution of a function before
>> data has been received, the connection is closed and if the number of
>> retries is reached, the execute method throws an exception.
>>
>> Nevertheless, how this timeout is set is not uniform across the different
>> clients:
>>
>>*   In the native C++ and C# clients, the timeout can be set on a per
>> function execution basis by passing the timeout to the Execution::execute()
>> method.
>>*   In the Java API, the timeout can only be set globally by a system
>> property when starting the client process.
>>
>> 2. Timeout in ResultCollector::getResult()
>>
>> Apart from the timeout on the function execution, the client API offers
>> the possibility of setting a timeout on the getResult() method of the
>> ResultCollector object returned by the Execution::execute() method.
>>
>> Given that this method is blocking when invoked from a client (until all
>> results have been received) then the setting of this timeout has no effect
>> at all. In fact, the DefaultResultCollector in the Java API just ignores
>> the value of the timeout.
>>
>> Note that this timeout in the ResultCollector::getResult() method is
>> useful when used inside a peer as function invocations are not blocking.
>>
>> 3. Blocking  Execution::execute()
>>
>> As mentioned above the Execution::execute() method behavior is different
>> when invoked from clients than from peers. When invoked from clients it is
>> blocking until all results are received while when invoked from peers it is
>> non-blocking and the ResultCollector::getResult() method is used to wait to
>> get all the results.
>>
>> This is not explicit in the documentation of the interface and it has
>> already been captured in the following ticket:
>> https://issues.apache.org/jira/browse/GEODE-3817
>>
>> Anti-Goals
>>
>> -
>>
>> Solution
>>
>> In order to make the API more coherent two actions are proposed:
>>
>> 1. Read timeout in function execution Java client API
>>
>> Add two new Execution::execute() methods in the Java client to offer the
>> possibility to set the timeout for the socket just as it is done in the C++
>> and C# clients.
>>
>> This action is simple to implement and there are no adverse effects
>> attached to it.
>>
>> 2. Timeout in ResultCollector::getResult() and Execution::execute()
>> blocking
>>
>> Regarding the timeout in the ResultCollector::getResult() method problem
>> and the blocking/non-blocking confusion for Execution::execute() two
>> alternatives are considered:
>>
>> a) Remove the possibility of setting a timeout on the
>> ResultCollector::getResult() method on the client side as with the current
>> client implementation it is useless. This could be done by removing the
>> method with the timeout parameter from the public API.
>>
>> It would be advisable to make explicit in the documentation that the
>> getResult() method does not wait for results to arrive as that should have
>> already been done in the Execution::execute() invocation.
>>
>> This alternative is very simple and would keep things pretty much as they
>> are today.
>>
>> b) Transform the Execution::execute() method on the clie

Re: Rat check reporting error in travis-ci for geode-native

2019-08-22 Thread Alberto Gomez

I created the following PR to fix the Rat error.

https://github.com/apache/geode-native/pull/512

Could someone review it and merge it if correct?

Thanks,

-Alberto G.

On 21/8/19 23:59, Dan Smith wrote:
> Hi Helena,
>
> I think the problem is that rat is currently broken on geode-native
> develop.
>
> -Dan
>
> On Wed, Aug 21, 2019 at 2:55 PM Helena Bales  wrote:
>
>> Hi Alberto,
>>
>> If you have added a file as part of your PR, please add the header as a
>> commit to your working branch and add that as part of the same PR. Your
>> branch should be able to pass a build, spotlessCheck, and unit tests as a
>> prerequisite for making a pull request.
>>
>> Please follow up here if you have more questions.
>> Thank you,
>> Helena Bales
>>
>> On Tue, Aug 20, 2019 at 7:40 AM Alberto Gomez 
>> wrote:
>>
>>> Hi,
>>>
>>> I sent a PR for the geode-native repo and the travis-ci is reporting a
>>> Rat error on the following file:
>>>
>>> ./packer/windows/install-doxygen.ps1
>>>
>>> The reason is that the copyright header is not included in it. I could
>>> add it in my PR but I guess it would be more convenient if this is fixed
>>> on a independent commit by someone working on it.
>>>
>>> Could anyone take care of it?
>>>
>>> Thanks,
>>>
>>> - Alberto G.
>>>
>>>
>>>

Re: Fwd: [DISCUSS] Improvements on client function execution API

2019-08-23 Thread Alberto Gomez

Hi Dan,

Please, see my answers below.

On 22/8/19 20:29, Dan Smith wrote:
> For the two new execute() methods you are proposing, are you also
> suggesting that they be implemented on the server?
I was only thinking about the client.
>
> I took a look at the C++ API. The C++ API also has a
> ResultCollector.getResult(timeout) method. Does that timeout do anything,
> given that execute(timeout) already waited for the given timeout?
The DefaultResultCollector in the C++ API handles the timeout with a 
condition variable in the getResult method (unlike the Java API) but as 
Jacob Barrett pointed out on another e-mail, given that the execute 
method is blocking, the timeout is pointless here too.
>
> Long term, are you thinking that we should remove execute() with no
> parameters and getResult(timeout) and just have these two blocking
> execute(timeout) methods on the server and clients? I think you are right
> that we shouldn't change the behavior of the existing execute() to avoid
> breaking our users. But I also want to understand what our long term plan
> is for a consistent and sensible API.

It depends on the option that gets selected.

If we go for option a), the idea would be to keep in clients execute() 
with no parameters and have also execute() with timeout, and remove the 
getResult(timeout) method keeping just the getResult() one (which would 
be non-blocking as it is today).

On the servers, I would leave things as they are, execute() is not 
blocking. Not sure if it makes sense to keep these differences between 
clients and servers and if it does, how we could maintain them in the 
API. The fact is that currently execute is very different in clients and 
in servers.

If we go for option b), the idea would be to keep in clients execute() 
with no parameters and have also execute() with timeout in order not to 
lose the feature of closing connections if read timeout expires. The 
socket timeout exception would have to be thrown by the getResult() 
method instead of the execute method. The execute() methods would be 
non-blocking and it would be getResult() the one which is blocking, both 
on servers and clients.

For option b) in order to deprecate the blocking behavior in execute() 
we could keep both implementations (blocking and non-blocking) and have 
a system property to select the behavior, which could be the blocking 
one by default in the release this is introduced. But long term, the aim 
would be to have just the non-blocking implementation of execute().

>
> -Dan
- Alberto

Re: [DISCUSS] Improvements on client function execution API

2019-08-23 Thread Alberto Gomez

Hi Jake,

Please, see my answers below.

On 22/8/19 21:16, Jacob Barrett wrote:
>
>> On Aug 21, 2019, at 8:49 AM, Alberto Gomez  wrote:
>> 2. Timeout in ResultCollector::getResult() and Execution::execute() blocking
>>
>> Regarding the timeout in the ResultCollector::getResult() method problem and 
>> the blocking/non-blocking confusion for Execution::execute() two 
>> alternatives are considered:
>>
>> a) Remove the possibility of setting a timeout on the 
>> ResultCollector::getResult() method on the client side as with the current 
>> client implementation it is useless. This could be done by removing the 
>> method with the timeout parameter from the public API.
>>
>> It would be advisable to make explicit in the documentation that the 
>> getResult() method does not wait for results to arrive as that should have 
>> already been done in the Execution::execute() invocation.
>>
>> This alternative is very simple and would keep things pretty much as they 
>> are today.
> To be honest I think approach would go against what a user “thinks” is going 
> on. Given that there hasn’t been a timeout on execute I assumed it was 
> asynchronous and that the getResult blocked until timeout or results arrived. 
> Typically these two calls were done back to back.

You are right if you look at the Java client. But if you look at the 
native clients, the timeout is there, both in the C++ and the C# cases 
which would indicate that it is a blocking call.

See 
https://geode.apache.org/releases/latest/cppdocs/a00725.html#aa918a5e193745950e12ca4feb9c5d776
 
and 
https://geode.apache.org/releases/latest/dotnetdocs/a00882.html#ae0a814049482ca424f89c13ab1099c3d
>
>> b) Transform the Execution::execute() method on the client side into a 
>> non-blocking method.
>>
>> This alternative is more complex and requires changes in all the clients. 
>> Apart from that it has implications on the public client API it requires 
>> moving the exceptions offered currently by the Execution::execute() method 
>> to the ResultCollector::getResult() and new threads will have to be managed.
> I think this is more in line with what users expect is going on based on the 
> current API, I know I have. If were are going to make any change I think this 
> is the one. I don’t think the behavior change is a problem since it's what is 
> expected to be happening anyway.
>
>> An outline of a possible implementation for option b) would be:
>>
>>   *   Instead of invoking the ServerRegionProxy::executeFunction() directly 
>> as it is done today, create a Future that invokes this method and returns 
>> the resultCollector passed as parameter.
> Do you really think we need to introduce Futures here into the API? I feel 
> like the ResultCollector acts as the Future. I don’t think any change needs 
> to be made to the API in this regard. The ResultCollector implementation 
> would just simply block as implied by the api for the timeout period. I would 
> change the method to have units though and deprecate the method without units.

I did not mean to introduce Futures in the API. My idea was to use Java 
Futures internally so that the ResultCollector returned by the 
getResult() method would wrap the Java Future with the ResultCollector 
that would actually hold the result.

An alternative would be to leave the logic of blocking to each 
implementation ResultCollector. In the case of the 
DefaultResultCollector, we could use a CountDownLatch that would be 
decremented when endResults() is called and that would make getResult() 
block by using CountDown.await(...).

The advantage of using Futures internally is that the blocking logic 
would not have to be implemented on every ResultCollector implementation.

>
> -Jake
>
>
- Alberto

PR reviewers needed

2019-08-27 Thread Alberto Gomez

Hi community,

Any volunteers to review the following PR?

PR in github: https://github.com/apache/geode-native/pull/511

JIRA: https://issues.apache.org/jira/browse/GEODE-7061

Thanks in advance,

Alberto

Re: [DISCUSS] Improvements on client function execution API

2019-08-29 Thread Alberto Gomez

Hi Dan,

Discussing these matters by e-mail is getting tricky.

Let's see if I understand you correctly and also if I am being clear enough.

Please, see my comments inline.

On 29/8/19 0:49, Dan Smith wrote:
> Sorry for the slow response, I've been trying to decide what I think is the
> right approach here.
>
> For (1) - conceptually, I don't have a problem with having both blocking
> and non blocking methods on Execution. So adding blocking versions of
> execute() with a timeout seems ok. But I do think if we add them we need to
> implement them on both the client and the server to behave the same way.
> That shouldn't be too hard on the server since execute(timeout) can just
> call getResult(timeout) internally.

We have to take into account that, currently, the timeout in execute() 
is not the same thing as the timeout in getResult().

On the one hand, the timeout set in execute() (via System property in 
the Java client, and with a parameter in the native client) sets a 
readtimeout on the socket which just means that if nothing is read from 
the socket after sending the request to the server for the given 
timeout, the corresponding exception will be thrown. It looks to me more 
like a protection against possible communication failures rather than a 
mechanism to decide if results took too long to be provided. So I would 
not link the presence of the timeout parameter in the method to the 
nature of the method (blocking or non-blocking). I think we could have 
this read timeout set and at the same time keep that method as non-blocking.

On the other hand, the timeout set in getResult() is a timeout to wait 
for all the results to be received from the moment the method is invoked.

Therefore, I would not implement the blocking version of execute() on 
the server by calling getResult() at the end.

Apart from that, I doubt if it would make sense to set this readTimeout 
in the execute() methods from servers given that the communication is 
very different to the one done with clients. I also doubt that anyone 
would be interested in the blocking version of execute() on the server.

My proposal was to add the readtimeout to the execute() methods in the 
Java client in order to align the Java client and the native client. 
This change would be independent to the decision we make regarding the 
change of execute() to non-blocking. To achieve this alignment, 
alternatively, we could remove the timeout parameter in execute() from 
the native clients and have it as a global property for the client to be 
set by whatever mechanism available as it is done in the Java client today.

Were you proposing that the execute() methods with timeout were blocking 
and the ones without timeout non-blocking? Not sure if this is something 
you meant.

>
> For (2) - Although I think the original authors of this API probably did
> intend for execute() to be non-blocking, the fact is that it does block on
> the client and most users are probably calling execute from a client. So I
> do agree we probably shouldn't change the behavior at this point. Perhaps
> we can just clearly document the current behavior of execute() as part of
> adding these new methods. Going forward we can add new methods to Execution
> that are clearly non-blocking (submit?, invoke?) and implement them
> consistently on *both* the client in the server, but that doesn't have to
> be in the scope of this proposal.

The problem I see with adding new non-blocking methods (new/submit...) 
is that it would be a solution for the current users of the client 
regarding backwards compatibility. But, on the server side, we would 
have to move the current logic of execute() which is non-blocking to the 
new methods and change the current execute() behavior to blocking. We 
would not impact the users of the client but we would impact the users 
of the server.

Again, I would propose to aim at:

a) either leave execute() on the client as blocking

b) or change execute() on the client to be non-blocking but on the Geode 
release this is introduced, have it configurable. The default behavior 
would be blocking (deprecated behavior) but could be set to non-blocking 
with a system property. On the next release, the blocking behavior would 
be removed.

- Alberto G.

> -Dan
>
> On Fri, Aug 23, 2019 at 4:28 AM Alberto Gomez 
> wrote:
>
>> Hi Jake,
>>
>> Please, see my answers below.
>>
>> On 22/8/19 21:16, Jacob Barrett wrote:
>>>> On Aug 21, 2019, at 8:49 AM, Alberto Gomez 
>> wrote:
>>>> 2. Timeout in ResultCollector::getResult() and Execution::execute()
>> blocking
>>>> Regarding the timeout in the ResultCollector::getResult() method
>> problem and the blocking/non-blocking confusion for Execution::execute()
>> two alternatives are considered:
>>>> a) Remove the possibi

Re: [DISCUSS] Improvements on client function execution API

2019-08-29 Thread Alberto Gomez

behavior of execute() as 
>> part of
>> adding these new methods. Going forward we can add new methods to 
>> Execution
>> that are clearly non-blocking (submit?, invoke?) and implement them
>> consistently on *both* the client in the server, but that doesn't 
>> have to
>> be in the scope of this proposal.
>
> The problem I see with adding new non-blocking methods (new/submit...) 
> is that it would be a solution for the current users of the client 
> regarding backwards compatibility. But, on the server side, we would 
> have to move the current logic of execute() which is non-blocking to 
> the new methods and change the current execute() behavior to blocking. 
> We would not impact the users of the client but we would impact the 
> users of the server.
>
> Again, I would propose to aim at:
>
> a) either leave execute() on the client as blocking
>
> b) or change execute() on the client to be non-blocking but on the 
> Geode release this is introduced, have it configurable. The default 
> behavior would be blocking (deprecated behavior) but could be set to 
> non-blocking with a system property. On the next release, the blocking 
> behavior would be removed.
>
> - Alberto G.
>
>> -Dan
>>
>> On Fri, Aug 23, 2019 at 4:28 AM Alberto Gomez 
>> wrote:
>>
>>> Hi Jake,
>>>
>>> Please, see my answers below.
>>>
>>> On 22/8/19 21:16, Jacob Barrett wrote:
>>>>> On Aug 21, 2019, at 8:49 AM, Alberto Gomez 
>>> wrote:
>>>>> 2. Timeout in ResultCollector::getResult() and Execution::execute()
>>> blocking
>>>>> Regarding the timeout in the ResultCollector::getResult() method
>>> problem and the blocking/non-blocking confusion for 
>>> Execution::execute()
>>> two alternatives are considered:
>>>>> a) Remove the possibility of setting a timeout on the
>>> ResultCollector::getResult() method on the client side as with the 
>>> current
>>> client implementation it is useless. This could be done by removing the
>>> method with the timeout parameter from the public API.
>>>>> It would be advisable to make explicit in the documentation that the
>>> getResult() method does not wait for results to arrive as that 
>>> should have
>>> already been done in the Execution::execute() invocation.
>>>>> This alternative is very simple and would keep things pretty much as
>>> they are today.
>>>> To be honest I think approach would go against what a user “thinks” is
>>> going on. Given that there hasn’t been a timeout on execute I 
>>> assumed it
>>> was asynchronous and that the getResult blocked until timeout or 
>>> results
>>> arrived. Typically these two calls were done back to back.
>>>
>>> You are right if you look at the Java client. But if you look at the
>>> native clients, the timeout is there, both in the C++ and the C# cases
>>> which would indicate that it is a blocking call.
>>>
>>> See
>>>
>>> https://geode.apache.org/releases/latest/cppdocs/a00725.html#aa918a5e193745950e12ca4feb9c5d776
>>>  
>>>
>>> and
>>>
>>> https://geode.apache.org/releases/latest/dotnetdocs/a00882.html#ae0a814049482ca424f89c13ab1099c3d
>>>  
>>>
>>>>> b) Transform the Execution::execute() method on the client side 
>>>>> into a
>>> non-blocking method.
>>>>> This alternative is more complex and requires changes in all the
>>> clients. Apart from that it has implications on the public client 
>>> API it
>>> requires moving the exceptions offered currently by the
>>> Execution::execute() method to the ResultCollector::getResult() and new
>>> threads will have to be managed.
>>>> I think this is more in line with what users expect is going on 
>>>> based on
>>> the current API, I know I have. If were are going to make any change I
>>> think this is the one. I don’t think the behavior change is a 
>>> problem since
>>> it's what is expected to be happening anyway.
>>>>> An outline of a possible implementation for option b) would be:
>>>>>
>>>>>    *   Instead of invoking the ServerRegionProxy::executeFunction()
>>> directly as it is done today, create a Future that invokes this 
>>> method and
>>> returns the resultCollector passed as parameter.
>>>> Do you really think we need to introduce Futures here into the API? I
>>> feel like the ResultCollector acts as the Future. I don’t think any 
>>> change
>>> needs to be made to the API in this regard. The ResultCollector
>>> implementation would just simply block as implied by the api for the
>>> timeout period. I would change the method to have units though and
>>> deprecate the method without units.
>>>
>>> I did not mean to introduce Futures in the API. My idea was to use Java
>>> Futures internally so that the ResultCollector returned by the
>>> getResult() method would wrap the Java Future with the ResultCollector
>>> that would actually hold the result.
>>>
>>> An alternative would be to leave the logic of blocking to each
>>> implementation ResultCollector. In the case of the
>>> DefaultResultCollector, we could use a CountDownLatch that would be
>>> decremented when endResults() is called and that would make getResult()
>>> block by using CountDown.await(...).
>>>
>>> The advantage of using Futures internally is that the blocking logic
>>> would not have to be implemented on every ResultCollector 
>>> implementation.
>>>
>>>> -Jake
>>>>
>>>>
>>> - Alberto
>>>

Reason for gemfire.CLIENT_FUNCTION_TIMEOUT

2019-08-30 Thread Alberto Gomez

Hi,

The gemfire.CLIENT_FUNCTION_TIMEOUT system property allows to set a 
specific read timeout on the socket when the operation sent over a 
client connection is a function execution. I have not seen this property 
documented anywhere but I have seen how it works in the code.

Does anybody know why would it be interesting to have a different read 
timeout on the socket when executing functions to the read timeout set 
for the rest of operations?

Thanks in advance,

-Alberto G.

Re: [DISCUSS] Improvements on client function execution API

2019-09-05 Thread Alberto Gomez

Hi all,

First of all, thanks a lot Dan and Jacob for your feedback.

As we are getting close to the deadline I am adding here some conclusions and a 
refined proposal in order to get some more feedback and if possible some voting 
on the two alternatives proposed (or any other in between if you feel any of 
them is lacking something).

I also add some draft code to try to clarify a bit the more complex of the 
alternatives.


Proposal summary (needs a decision on which option to implement):
---

In order to make the API more coherent two alternatives are proposed:

a) Remove the timeout from the ResultCollector::getResult() / document that the 
timeout has no effect, taking into account that Execution::execute() is always 
blocking.
Additionally we could add the timeout parameter to the Execution::execute() 
method of the Java API in order to align it with the native client APIs. This 
timeout would not be the read timeout on the socket but a timeout for the 
execution of the operation.

b) Change the implementation of the Execution::execute() method without timeout 
to be non-blocking on both the Java and native APIs. This change has backward 
compatibility implications, would probably bring some performance decrease and 
could pose some difficulties in the implementation on the C++ side (in the  
handling of timed out operations that hold resources).


The first option (a) is less risky and does not have impacts regarding backward 
compatibility and performance.

The second one (b) is the preferred alternative in terms of the expected 
behavior from the users of the API. This option is more complex to implement 
and as mentioned above has performance and backward compatibility issues not 
easy to be solved.

Following is a draft version of the implementation of b) on the Java client:
https://github.com/Nordix/geode/commit/507a795e34c6083c129bda7e976b9223d1a893da

Following is a draft version of the implementation of b) on the C++ native 
client:
https://github.com/apache/geode-native/commit/a03a56f229bb8d75ee71044cf6196df07f43150d

Note that the above implementation of b) in the C++ client implies that the 
Execution object returned by the FunctionService cannot be destroyed until the 
thread executing the function asynchronously has finished. If the function 
times out, the Execution object must be kept until the thread finishes.


Other considerations
-

* Currently, in the function execution Java client there is not a possibility 
to set a timeout for the execution of functions. The closest to this is the 
read timeout that may be set globally for function executions but this is not 
really a timeout for operations.

* Even if the API for function execution is the same on clients and servers, 
the implementation is different between them. On the clients, the execute() 
methods are blocking while on the servers it is non-blocking and the invoker of 
the function blocks on the getResult() method of the ResultCollector returned 
by the execute() method.
Even if having both blocking and non-blocking implementation of execute() in 
both clients and servers sounds desirable from the point of view of 
orthogonality, this  could bring complications in terms of backward 
compatibility. Besides, a need for a blocking version of function execution has 
not been found.

-Alberto G.

On 29/8/19 16:48, Alberto Gomez wrote:

Sorry, some corrections on my comments after revisiting the native
client code.

When I said that the timeout used in the execution() method (by means of
a system property) was to set a read timeout on the socket, I was only
talking about the Java client. In the case of the native clients, the
timeout set in the execute() method is not translated into a socket
timeout but it is the time to wait for the operation to complete, i.e.,
to get all the results back.

Things being so, I would change my proposal to:

- Change the implementation of execute() on both Java and native clients
to be non-blocking (having the blocking/non-blocking behavior
configurable in the release this is introduced and leaving only the
non-blocking behavior in the next release).

- Either remove the execute() with timeout methods in the native clients
(with a deprecation release) or implement the execute(timeout) method in
the Java client to be blocking (to work as the native client does
today). In case the method times out, the connection will not be closed.
If the operation times out due to the socket timeout (system property),
then the connection will be closed as it is now done in the Java client.

- Do not implement the blocking execute(timeout) method on the server
and leave the current execute() implementation on the server as it is
(non-blocking)

Does this make sense?

-Alberto

On 29/8/19 12:56, Alberto Gómez wrote:


Hi Dan,

Discussing these matters by e-mail is getting tricky.

Let's see

Re: [DISCUSS] Improvements on client function execution API

2019-09-16 Thread Alberto Gomez

Thanks for the feedback. I also give a +1 to option a) including Dan's 
comments.

I'll move the RFC to the Development state and will open a ticket to 
follow up on the implementation.

-Alberto G.

On 12/9/19 8:15, Jacob Barrett wrote:
> +1
>
> I echo Dan’s comments as well.
>
> Thanks for tackling this.
>
> -jake
>
>
>> On Sep 11, 2019, at 2:36 PM, Dan Smith  wrote:
>>
>> +1 - Ok, I think I've come around to option (a). We can go head and add a
>> new execute(timeout, TimeUnit) method to the java API that is blocking. We
>> can leave the existing execute() method alone, except for documenting what
>> it is doing.
>>
>> I would like implement execute(timeout,  TimeUnit) on the server side as
>> well. Since this Execution class is shared between both client and server
>> APIs, it would be unfortunate to have a method on Execution that simply
>> doesn't work on the server side.
>>
>> -Dan
>>
>>
>>> On Thu, Sep 5, 2019 at 9:25 AM Alberto Gomez  wrote:
>>>
>>> Hi all,
>>>
>>> First of all, thanks a lot Dan and Jacob for your feedback.
>>>
>>> As we are getting close to the deadline I am adding here some conclusions
>>> and a refined proposal in order to get some more feedback and if possible
>>> some voting on the two alternatives proposed (or any other in between if
>>> you feel any of them is lacking something).
>>>
>>> I also add some draft code to try to clarify a bit the more complex of the
>>> alternatives.
>>>
>>>
>>> Proposal summary (needs a decision on which option to implement):
>>>
>>> ---
>>>
>>> In order to make the API more coherent two alternatives are proposed:
>>>
>>> a) Remove the timeout from the ResultCollector::getResult() / document
>>> that the timeout has no effect, taking into account that
>>> Execution::execute() is always blocking.
>>> Additionally we could add the timeout parameter to the
>>> Execution::execute() method of the Java API in order to align it with the
>>> native client APIs. This timeout would not be the read timeout on the
>>> socket but a timeout for the execution of the operation.
>>>
>>> b) Change the implementation of the Execution::execute() method without
>>> timeout to be non-blocking on both the Java and native APIs. This change
>>> has backward compatibility implications, would probably bring some
>>> performance decrease and could pose some difficulties in the implementation
>>> on the C++ side (in the  handling of timed out operations that hold
>>> resources).
>>>
>>>
>>> The first option (a) is less risky and does not have impacts regarding
>>> backward compatibility and performance.
>>>
>>> The second one (b) is the preferred alternative in terms of the expected
>>> behavior from the users of the API. This option is more complex to
>>> implement and as mentioned above has performance and backward compatibility
>>> issues not easy to be solved.
>>>
>>> Following is a draft version of the implementation of b) on the Java
>>> client:
>>>
>>> https://github.com/Nordix/geode/commit/507a795e34c6083c129bda7e976b9223d1a893da
>>>
>>> Following is a draft version of the implementation of b) on the C++ native
>>> client:
>>>
>>> https://github.com/apache/geode-native/commit/a03a56f229bb8d75ee71044cf6196df07f43150d
>>>
>>> Note that the above implementation of b) in the C++ client implies that
>>> the Execution object returned by the FunctionService cannot be destroyed
>>> until the thread executing the function asynchronously has finished. If the
>>> function times out, the Execution object must be kept until the thread
>>> finishes.
>>>
>>>
>>> Other considerations
>>> -
>>>
>>> * Currently, in the function execution Java client there is not a
>>> possibility to set a timeout for the execution of functions. The closest to
>>> this is the read timeout that may be set globally for function executions
>>> but this is not really a timeout for operations.
>>>
>>> * Even if the API for function execution is the same on clients and
>>> servers, the implementation is different between them. On the clients, the
>>> execute() methods are blocking while on the servers it is non-

Re: [DISCUSS] Improvements on client function execution API

2019-09-17 Thread Alberto Gomez

Thanks Anil. This is how this started, trying to pop the timeout up from 
a system property to a parameter in the Execution methods.

-Alberto G.

On 16/9/19 21:37, Anilkumar Gingade wrote:
> Alberto,
>
> Sorry for late responseCurrently geode (java client) does provide
> ability to set function timeout; but its through internal system property
> "gemfire.CLIENT_FUNCTION_TIMEOUT"Some of the tests using this property
> are Tests extending "FunctionRetryTestBase".
>
> Since this is through internal system property; we need a cleaner api to
> achieve the timeout behavior.
>
> +1 for the option a) proposed.
>
> -Anil
>
>
>
>
> On Mon, Sep 16, 2019 at 9:03 AM Dan Smith  wrote:
>
>> Thanks for following up on this!
>>
>> -Dan
>>
>> On Mon, Sep 16, 2019 at 3:07 AM Alberto Gomez 
>> wrote:
>>
>>> Thanks for the feedback. I also give a +1 to option a) including Dan's
>>> comments.
>>>
>>> I'll move the RFC to the Development state and will open a ticket to
>>> follow up on the implementation.
>>>
>>> -Alberto G.
>>>
>>> On 12/9/19 8:15, Jacob Barrett wrote:
>>>> +1
>>>>
>>>> I echo Dan’s comments as well.
>>>>
>>>> Thanks for tackling this.
>>>>
>>>> -jake
>>>>
>>>>
>>>>> On Sep 11, 2019, at 2:36 PM, Dan Smith  wrote:
>>>>>
>>>>> +1 - Ok, I think I've come around to option (a). We can go head and
>>> add a
>>>>> new execute(timeout, TimeUnit) method to the java API that is
>> blocking.
>>> We
>>>>> can leave the existing execute() method alone, except for documenting
>>> what
>>>>> it is doing.
>>>>>
>>>>> I would like implement execute(timeout,  TimeUnit) on the server side
>> as
>>>>> well. Since this Execution class is shared between both client and
>>> server
>>>>> APIs, it would be unfortunate to have a method on Execution that
>> simply
>>>>> doesn't work on the server side.
>>>>>
>>>>> -Dan
>>>>>
>>>>>
>>>>>> On Thu, Sep 5, 2019 at 9:25 AM Alberto Gomez >> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> First of all, thanks a lot Dan and Jacob for your feedback.
>>>>>>
>>>>>> As we are getting close to the deadline I am adding here some
>>> conclusions
>>>>>> and a refined proposal in order to get some more feedback and if
>>> possible
>>>>>> some voting on the two alternatives proposed (or any other in between
>>> if
>>>>>> you feel any of them is lacking something).
>>>>>>
>>>>>> I also add some draft code to try to clarify a bit the more complex
>> of
>>> the
>>>>>> alternatives.
>>>>>>
>>>>>>
>>>>>> Proposal summary (needs a decision on which option to implement):
>>>>>>
>>>>>>
>> ---
>>>>>> In order to make the API more coherent two alternatives are proposed:
>>>>>>
>>>>>> a) Remove the timeout from the ResultCollector::getResult() /
>> document
>>>>>> that the timeout has no effect, taking into account that
>>>>>> Execution::execute() is always blocking.
>>>>>> Additionally we could add the timeout parameter to the
>>>>>> Execution::execute() method of the Java API in order to align it with
>>> the
>>>>>> native client APIs. This timeout would not be the read timeout on the
>>>>>> socket but a timeout for the execution of the operation.
>>>>>>
>>>>>> b) Change the implementation of the Execution::execute() method
>> without
>>>>>> timeout to be non-blocking on both the Java and native APIs. This
>>> change
>>>>>> has backward compatibility implications, would probably bring some
>>>>>> performance decrease and could pose some difficulties in the
>>> implementation
>>>>>> on the C++ side (in the  handling of timed out operations that hold
>>>>>> resources).
>>>>>>
>>>>>>
>>>>>> The first o

Question about rolling back a Geode upgrade

2019-09-23 Thread Alberto Gomez

Hi,

Looking at the Geode documentation I have not found any reference to 
rolling back a Geode upgrade.

Running some tests, I have observed that once a Geode System has been 
upgraded to a later version, it is not possible to rollback the upgrade 
even if no data modifications have been done after the upgrade.

The system protects itself in several places: gfsh does not allow you to 
connect to a newer version of Geode, the Oplog files store the version 
of the server which prevents an older server to start from a file from a 
newer server, the cluster also does not allow older members to join a 
cluster with newer members and there are probably other protections I 
did not hit.

Even if you tamper with some of those protections, you can run into 
trouble due to compatibility issues. I ran into one when I lifted up the 
requirement to have the same gfsh versions using versions 1.8 and 1.10 
because it seems there is some configuration exchanged in Json format 
whose format has changed between those two versions.

My question is that if it has ever been considered to support rollback 
of Geode upgrades (preferably in rolling mode), at least between systems 
under the same major version. In our experience customers often require 
the rollback of upgrades.

Thanks in advance for your help,

-Alberto G.

Reviewers for PR #3888

2019-09-23 Thread Alberto Gomez

Hi,

Could I please get some extra reviewers for 
https://github.com/apache/geode/pull/3888?

This PR is about GEODE-7049: Add timeout parameter to Java client 
Execution::execute() methods 
(https://jira.apache.org/jira/browse/GEODE-7049).

Thanks,

-Alberto G.

Re: Question about rolling back a Geode upgrade

2019-09-23 Thread Alberto Gomez

Hi Anthony,

That's an option but, as you say, the cost in infrastructure is high and 
there are also other problems to solve like how to do the switch between 
systems and how to assure the data consistency among them.

I was thinking that in many cases it might be possible to support a 
rolling downgrade similar to the rolling upgrade given that the rolling 
upgrade already allows the coexistence of old and new members in a cluster.

-Alberto

On 23/9/19 15:55, Anthony Baker wrote:
> Have you considered using a blue / green deployment approach?  It provides 
> more flexibility for these scenarios though the infrastructure cost is high.
>
> Anthony
>
>
>> On Sep 23, 2019, at 5:59 AM, Alberto Gomez  wrote:
>>
>> Hi,
>>
>> Looking at the Geode documentation I have not found any reference to
>> rolling back a Geode upgrade.
>>
>> Running some tests, I have observed that once a Geode System has been
>> upgraded to a later version, it is not possible to rollback the upgrade
>> even if no data modifications have been done after the upgrade.
>>
>> The system protects itself in several places: gfsh does not allow you to
>> connect to a newer version of Geode, the Oplog files store the version
>> of the server which prevents an older server to start from a file from a
>> newer server, the cluster also does not allow older members to join a
>> cluster with newer members and there are probably other protections I
>> did not hit.
>>
>> Even if you tamper with some of those protections, you can run into
>> trouble due to compatibility issues. I ran into one when I lifted up the
>> requirement to have the same gfsh versions using versions 1.8 and 1.10
>> because it seems there is some configuration exchanged in Json format
>> whose format has changed between those two versions.
>>
>> My question is that if it has ever been considered to support rollback
>> of Geode upgrades (preferably in rolling mode), at least between systems
>> under the same major version. In our experience customers often require
>> the rollback of upgrades.
>>
>> Thanks in advance for your help,
>>
>> -Alberto G.
>>

Re: Question about rolling back a Geode upgrade

2019-09-26 Thread Alberto Gomez

Hi again,

I have been investigating a bit more the possibility of supporting 
"rolling downgrades" in Geode similar to rolling upgrades and I would 
like to share my findings and also ask for some help.

My tests were done upgrading from Geode 1.10 to a recent version in the 
develop branch and rolling back (downgrading) to 1.10. I was using one 
locator and two servers. I am sure my findings would have been different 
if I used other Geode versions or another configuration.

By doing some changes in code, I managed to rollback the servers but I 
got into trouble when starting the old locator.

The changes I did where the following:

- I removed the check for equality for the local and remote versions of 
Geode in ConnectCommand::connect() so that it was allowed to connect to 
Geode with a newer or older version of gfsh.
- I started the locators and servers with the 
gemfire.allow_old_members_to_join_for_testing property to allow old 
members to join a newer Geode system.
- I changed Version::fromOrdinal method to return CURRENT instead of 
throwing an exception when the ordinal passed corresponds to a version 
not supported. I had to do this change in order for old servers to be 
able to progress when reading oplogs generated by newer servers.

After downgrading the servers successfully, I stopped the new locator, 
started the old one (with the old gfsh) and got an exception in the 
locator when reading from the view file:

The Locator process terminated unexpectedly with exit status 1. Please 
refer to the log file in 
/home/alberto/geode/geode-releases/apache-geode-1.0.0/locator1 for full 
details.

Exception in thread "main" org.apache.geode.InternalGemFireException: 
Unable to recover previous membership view from 
/home/alberto/geode/geode-releases/apache-geode-1.10.0/locator1/locator10334view.dat

     at 
org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator.recoverFromFile(GMSLocator.java:492)
     ...
     Caused by: java.io.StreamCorruptedException: invalid type code: 02

     at 
java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2871)
     ...

I think the problem is in the deserialization due to the fact that the 
format of the locator's view file has changed between both Geode 
versions after GEODE-7090.

This leads me to think that I might have been successful in the "rolling 
downgrade" if I had selected other versions of Geode or I might have run 
into a different set of problems.

After this research I would like to get some feedback from the community 
on the following questions:

- Would it be reasonable to restrict future changes in Geode between 
minor versions so that the rolling downgrade is supported? This would 
imply that changes such as the one done in GEODE-7090 would not be 
allowed for a minor version change.

- Could the changes in code and configuration I have done in my tests to 
support the "rolling downgrade" have any negative secondary effects 
which should dissuade us from using them?

- Are there any other things I have not taken into account that would 
require changes in order to support rolling upgrades?

- Is it even feasible to implement "rolling downgrades" of Geode with 
some restrictions or there are always possible incompatibilities between 
versions that make it impossible or unreasonably hard to support this 
kind of feature?

Thanks in advance for your help,

-Alberto G.

On 23/9/19 17:04, Alberto Gomez wrote:
> Hi Anthony,
>
> That's an option but, as you say, the cost in infrastructure is high and
> there are also other problems to solve like how to do the switch between
> systems and how to assure the data consistency among them.
>
> I was thinking that in many cases it might be possible to support a
> rolling downgrade similar to the rolling upgrade given that the rolling
> upgrade already allows the coexistence of old and new members in a cluster.
>
> -Alberto
>
> On 23/9/19 15:55, Anthony Baker wrote:
>> Have you considered using a blue / green deployment approach?  It provides 
>> more flexibility for these scenarios though the infrastructure cost is high.
>>
>> Anthony
>>
>>
>>> On Sep 23, 2019, at 5:59 AM, Alberto Gomez  wrote:
>>>
>>> Hi,
>>>
>>> Looking at the Geode documentation I have not found any reference to
>>> rolling back a Geode upgrade.
>>>
>>> Running some tests, I have observed that once a Geode System has been
>>> upgraded to a later version, it is not possible to rollback the upgrade
>>> even if no data modifications have been done after the upgrade.
>>>
>>> The system protects itself in several places: gfsh does not allow you to
>>> connect to a newer version of Geode, the Oplog files store the version
>>> of the server

Re: Question about rolling back a Geode upgrade

2019-10-16 Thread Alberto Gomez

Hi,

Thanks for your suggestion, Anthony. I gave it a try and the rolling downgrade
worked because peers are prepared to exchange messages in different versions by
means of the toDataPre_GEODE_X_X_X_X() fromDataPre_GEODE_X_X_X_X mechanism.

After all my investigation on this topic I have come to the conclusion that the
main problem to rolling downgrades is the persistent files compatibility. If
the persistent files format does not change between versions (not likely to
happen) then the rolling downgrade should be straightforward. But, given that
this is not something that can be guaranteed, in order to support the rolling
upgrades it would be necessary (among other things) to provide a tool to
convert files from the new to the old version because, in general, older
members will not be able to read files written with newer members. Newer
members are normally prepared to read older version files as it is noted in the
reference you sent about Geode backward compatibility but not the other way
around (as they would have to know the future :-)).

In my tests, I ran into a problem when trying to perform a rolling downgrade
from version 1.11 to 1.10 because the format of the view files for locators had
changed due to GEODE-7090. Nevertheless, I managed to do a rolling downgrade
from version 1.10 to 1.9 because there were no changes to the format of
persistent files.

If anybody could share any other insight on this subject, it would be
appreciated.

Best regards,

/Alberto G.

On 9/10/19 21:50, Anthony Baker wrote:

Hi Alberto!

Another experiment that might be useful to try is changing a p2p message
following [1]. If you follow the steps in the wiki, a rolling upgrade should
work ok. But if you then try to do a rolling downgrade, what happens?

Anthony

[1]
https://cwiki.apache.org/confluence/display/GEODE/Managing+Backward+Compatibility

On Sep 26, 2019, at 9:37 AM, Alberto Gomez
<mailto:alberto.go...@est.tech> wrote:

Hi again,

I have been investigating a bit more the possibility of supporting
"rolling downgrades" in Geode similar to rolling upgrades and I would
like to share my findings and also ask for some help.

My tests were done upgrading from Geode 1.10 to a recent version in the
develop branch and rolling back (downgrading) to 1.10. I was using one
locator and two servers. I am sure my findings would have been different
if I used other Geode versions or another configuration.

By doing some changes in code, I managed to rollback the servers but I
got into trouble when starting the old locator.

The changes I did where the following:

- I removed the check for equality for the local and remote versions of
Geode in ConnectCommand::connect() so that it was allowed to connect to
Geode with a newer or older version of gfsh.
- I started the locators and servers with the
gemfire.allow_old_members_to_join_for_testing property to allow old
members to join a newer Geode system.
- I changed Version::fromOrdinal method to return CURRENT instead of
throwing an exception when the ordinal passed corresponds to a version
not supported. I had to do this change in order for old servers to be
able to progress when reading oplogs generated by newer servers.

After downgrading the servers successfully, I stopped the new locator,
started the old one (with the old gfsh) and got an exception in the
locator when reading from the view file:

The Locator process terminated unexpectedly with exit status 1. Please
refer to the log file in
/home/alberto/geode/geode-releases/apache-geode-1.0.0/locator1 for full
details.

Exception in thread "main" org.apache.geode.InternalGemFireException:
Unable to recover previous membership view from
/home/alberto/geode/geode-releases/apache-geode-1.10.0/locator1/locator10334view.dat

at
org.apache.geode.distributed.internal.membership.gms.locator.GMSLocator.recoverFromFile(GMSLocator.java:492)
...
Caused by: java.io.StreamCorruptedException: invalid type code: 02

at
java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2871)
...

I think the problem is in the deserialization due to the fact that the
format of the locator's view file has changed between both Geode
versions after GEODE-7090.

This leads me to think that I might have been successful in the "rolling
downgrade" if I had selected other versions of Geode or I might have run
into a different set of problems.

After this research I would like to get some feedback from the community
on the following questions:

- Would it be reasonable to restrict future changes in Geode between
minor versions so that the rolling downgrade is supported? This would
imply that changes such as the one done in GEODE-7090 would not be
allowed for a minor version change.

- Could the changes in code and configuration I have done in my tests to
support the "rolling downgrade" have any negative secondary effects
which should dissuad

Review for #4204

2019-10-28 Thread Alberto Gomez

Hi,

Could I ask for a review on https://github.com/apache/geode/pull/4204?

This PR is about GEODE-7157: 
(https://jira.apache.org/jira/browse/GEODE-7157).

Thanks,

/Alberto G.

Re: Review for #4204

2019-11-05 Thread Alberto Gomez

Hi,

Any volunteer to merge this PR that has already been approved?

Thanks,

/Alberto G.

On 28/10/19 16:05, Alberto Gómez wrote:
> Hi,
>
> Could I ask for a review on https://github.com/apache/geode/pull/4204?
>
> This PR is about GEODE-7157: 
> (https://jira.apache.org/jira/browse/GEODE-7157).
>
> Thanks,
>
> /Alberto G.
>
>
>

Reviewers wanted for GEODE-7509: Fix memory leaks in C++ client

2019-11-29 Thread Alberto Gomez

Hi,

Could I get some reviewers for https://github.com/apache/geode-native/pull/556?
[https://avatars3.githubusercontent.com/u/47359?s=400&v=4]
GEODE-7509: Fix memory leaks in C++ client by albertogpz · Pull Request #556 · 
apache/geode-native
This commit solves some more memory leaks and uninitialized memory accesses 
found in the C++ client when running the integration tests that were not solved 
by GEODE-7476
github.com


Thanks in advance,

/Alberto G.

Re: upgrade of geode version

2019-12-12 Thread Alberto Gomez

Hi Yossi,

Version 1.4 of the cpp native client should work with Geode version 1.10. 
Client/server backward compatibility is a requirement in Geode as described 
here:

https://cwiki.apache.org/confluence/display/GEODE/Managing+Backward+Compatibility

-
Client/server backward compatibility

Client/server backward compatibility must work for all releases. All previous 
versions of clients must be supported because they may be embedded in 
applications that users would find difficult, if not impossible, to track down 
and upgrade.

-


Nevertheless, it is always advisable to use the client in the latest release 
available, so if it is possible for you to upgrade the clients, the 
recommendation would be to upgrade to cpp native client version 1.10.

Best regards,

/Alberto


From: Yossi Reginiano 
Sent: Thursday, December 12, 2019 9:24 AM
To: dev@geode.apache.org 
Cc: u...@geode.apache.org 
Subject: upgrade of geode version


Hi team,



We are currently using geode version 1.4 and wish to upgrade to version 1.10

We are still using cpp client geode-native - 
pre-modernization
 (that was released on the 8th of may 2017), can we still work with this client 
after the upgrade to 1.10?

If not – which cpp client version is suitable?



Thanks,

Yossi Reginiano



This email and the information contained herein is proprietary and confidential 
and subject to the Amdocs Email Terms of Service, which you may review at 
https://www.amdocs.com/about/email-terms-of-service

Issues with TransactionIds managed by CacheTransactionManager in C++ native client

2019-12-13 Thread Alberto Gomez

Hi,

I have created a ticket with some issues I have found related to TransactionIds 
managed by CacheTransactionManager in the C++ native client.

https://issues.apache.org/jira/browse/GEODE-7573

In it, I also propose some solutions to the issues found.

I'd appreciate if someone could review the analysis to see if I am in the right 
track or if I am missing something.

Thanks in advance,

Alberto

Reviewer for GEODE-7534: Add example for query with bind params (documentation)

2019-12-13 Thread Alberto Gomez

Hi,

I'd appreciate some extra reviewer (I already had one, thanks @Dave Barnes) and 
if everything is ok someone to merge the following pull request:

https://github.com/apache/geode/pull/4452

Thanks,

Alberto

Re: Issues with TransactionIds managed by CacheTransactionManager in C++ native client

2019-12-20 Thread Alberto Gomez

Hi Blake,

Are you sure that the TransactionId is always set to -1?

We have used the C++ native client transactions and they apparently worked.

/Alberto G.

From: Blake Bender 
Sent: Friday, December 13, 2019 6:17 PM
To: dev@geode.apache.org 
Subject: Re: Issues with TransactionIds managed by CacheTransactionManager in 
C++ native client

Transactions are an area of the codebase I've never dealt with, so I'm sure
most/all of what you mention is true.  In particular, the only thing I know
about TransactionId is that it's always set to -1, so functionally it's
essentially useless.  I'm not certain what all of the implications will be
if suddenly we ascribe meaning to it, so I'll let some folks with more
native client history than I chime in.

Thanks,

Blake

On Fri, Dec 13, 2019 at 3:15 AM Alberto Gomez 
wrote:

> Hi,
>
> I have created a ticket with some issues I have found related to
> TransactionIds managed by CacheTransactionManager in the C++ native client.
>
> https://issues.apache.org/jira/browse/GEODE-7573
>
> In it, I also propose some solutions to the issues found.
>
> I'd appreciate if someone could review the analysis to see if I am in the
> right track or if I am missing something.
>
> Thanks in advance,
>
> Alberto
>

Re: [DISCUSSION] - ClassLoader Isolation

2020-02-26 Thread Alberto Gomez

Great proposal!

We have run into the problems mentioned in the RFC so it will be a very good 
addition to Geode.

+1

Obtener Outlook para Android

From: Udo Kohlmeyer 
Sent: Wednesday, February 26, 2020 7:10:42 PM
To: dev@geode.apache.org 
Subject: [DISCUSSION] - ClassLoader Isolation

Hi there Geode Dev's.

There is a new RFC proposal on ClassLoader Isolation. The review end
date is 13 Feb 2020.

https://cwiki.apache.org/confluence/display/GEODE/ClassLoader+Isolation


Please review and discuss in this thread.

--Udo

RFC - Gateway sender to deliver transaction events atomically to receivers

2020-03-25 Thread Alberto Gomez

Hi,

Could you please review the RFC for "Gateway sender to deliver transaction 
events atomically to receivers"?

https://cwiki.apache.org/confluence/display/GEODE/Gw+sender+to+deliver+transaction+events+atomically+to+receivers

Deadline for comments is Wednesday, April 1st, 2020,

Thanks,

Alberto G.

Re: RFC - Gateway sender to deliver transaction events atomically to receivers

2020-04-02 Thread Alberto Gomez

Hi,

The

Yesterday was the end date for comments for this RFC.

I tried to answer the questions that were sent and also address the concerns 
about the proposal.

The main concern was related to the reordering of events that could happen in 
the gateway sender in order to group events of the same transaction in the same 
batch. My conclusion was that even if some reordering could happen, that would 
not mean that it was incorrect, given that it would be for events really close 
in time and also because there can already be some reordering of events between 
the time they are generated until they reach the sender's queue.

There was also a concern about adding a new field to each EntryEvent which 
would increase the over the wire format for everyone. The need for a new 
attribute in EntryEvent has been removed and in the new version of the proposal 
it is only needed to add the isLastTransactionEvent to the GatewaySenderEvent 
class.

Udo also showed some other and more general concerns which I do not know if 
have been resolved.

I would appreciate some more feedback so that I go for the pull request - if it 
is positive, or we keep the discussion alive.

Thanks in advance,

Alberto G.

From: Barry Oglesby 
Sent: Thursday, March 26, 2020 7:34 PM
To: dev@geode.apache.org 
Subject: Re: RFC - Gateway sender to deliver transaction events atomically to 
receivers

I added some comments to the proposal. There a few concerns, but I like the
idea in general.

Dan said: I remember someone trying to accomplish this same thing on top of
geode
with TransactionListener that dumped into a separate region or something
like that.

I think both Charlie and I have implemented this idea a few times,

Here is the basic idea:

The data region defines a TransactionListener with an afterCommit that:

- creates a UnitOfWork object
- creates an Event for each CacheEvent in the TransactionEvent event that
contains:
  - regionName
  - operation
  - key
  - value
  - potentially other things like EventID, VersionTag, TXId
- puts the UnitOfWork into a transaction region that has a gateway sender
attached to it. It also has a CacheWriter attached to it.

On the remote site, the CacheWriter attached to the transaction region:

- begins a transaction
- iterates the UnitOfWork's Events and executes each one
- commits the transaction

There are definitely some caveats to this:

- There is a race condition between the commit in the data region and the
TransactionListener afterCommit invocation doing the put into the
transaction region. If the server crashes after the put into the data
region but before the afterCommit callback, there will be data loss. In
that case, the transaction in question will not have been stored in the
transaction region and not be sent to the remote site.
- Ideally, the data and transaction regions should be colocated, but that
is a tricky.
- What happens if a transaction fails in the remote site?
- The transaction region has to be cleared periodically.
- Knowing when to process the transaction in the CacheWriter is a bit
tricky. It only needs to happen for transactions that originated remotely.
Adding distributed system id to the UnitOfWork is one way to address this.

Thanks,
Barry Oglesby

On Thu, Mar 26, 2020 at 7:34 AM Jacob Barrett  wrote:

> Great idea. I called out some similar areas of concerns and spit balled
> some solutions to get the conversations flowing.
>
> -Jake
>
>
> > On Mar 25, 2020, at 8:04 AM, Alberto Gomez 
> wrote:
> >
> > Hi,
> >
> > Could you please review the RFC for "Gateway sender to deliver
> transaction events atomically to receivers"?
> >
> >
> https://cwiki.apache.org/confluence/display/GEODE/Gw+sender+to+deliver+transaction+events+atomically+to+receivers
> >
> > Deadline for comments is Wednesday, April 1st, 2020,
> >
> > Thanks,
> >
> > Alberto G.
>
>

Help with Concourse-CI redis integration test failing

2020-04-11 Thread Alberto Gomez

Hi,

I am seeing the following integration test case fail in the Concourse-CI when 
executed on my pull request (https://github.com/apache/geode/pull/4928):

org.apache.geode.redis.SetsIntegrationTest > 
testSRem_should_notRemoveMembersOfSetNotSpecified FAILED

Interestingly, if I run the test locally it passes. Also, the line of code 
where the failure occurs does not point to a valid line in the code I see:

at 
org.apache.geode.redis.SetsIntegrationTest.testSRem_should_notRemoveMembersOfSetNotSpecified(SetsIntegrationTest.java:784

Any idea on what could be happening?

Thanks in advance,

-Alberto G.

Re: Help with Concourse-CI redis integration test failing

2020-04-11 Thread Alberto Gomez

Thanks Owen.

It seems you already initiated the revert.

Best regards,

-Alberto

From: Owen Nichols 
Sent: Saturday, April 11, 2020 8:04 PM
To: dev@geode.apache.org 
Subject: Re: Help with Concourse-CI redis integration test failing

Hi Alberto, it looks like Integration tests have been broken since PR#4937 was 
merged at the end of the day Friday.  However this PR had passed all its PR 
checks (as of 5 hours before it was merged) so I suspect this was just an 
unexpected interaction with the PR#4912 that came in just 3 hours before it).

In general there is an expectation in the Geode community that when a commit 
breaks the pipeline, it should be reverted first, then fixed.

Alberto, since I don’t think you’re a committer, how about you initiate the 
revert and I’ll approve it?  GitHub makes this pretty easy: go to 
https://github.com/apache/geode/pull/4937 and click Revert near the 
bottom-right of the page then click Create pull request.

> On Apr 11, 2020, at 10:01 AM, Alberto Gomez  wrote:
>
> Hi,
>
> I am seeing the following integration test case fail in the Concourse-CI when 
> executed on my pull request (https://github.com/apache/geode/pull/4928):
>
> org.apache.geode.redis.SetsIntegrationTest > 
> testSRem_should_notRemoveMembersOfSetNotSpecified FAILED
>
> Interestingly, if I run the test locally it passes. Also, the line of code 
> where the failure occurs does not point to a valid line in the code I see:
>
> at 
> org.apache.geode.redis.SetsIntegrationTest.testSRem_should_notRemoveMembersOfSetNotSpecified(SetsIntegrationTest.java:784
>
> Any idea on what could be happening?
>
> Thanks in advance,
>
> -Alberto G.

Re: Help with Concourse-CI redis integration test failing

2020-04-11 Thread Alberto Gomez

I clicked on the Create Pull Request button but got the following error:

Pull request creation failed. Validation failed: You can't perform that action 
at this time.

-Alberto


From: Owen Nichols 
Sent: Saturday, April 11, 2020 8:35 PM
To: dev@geode.apache.org 
Subject: Re: Help with Concourse-CI redis integration test failing

Looks like GitHub creates a branch for the revert as soon as someone clicks 
Revert (I didn’t mean to, I just wanted to see what the next screen looks like, 
I didn’t click Create Pull Request).

If I create the pull request, then we need to find a 3rd person to approve it, 
so hoping you can still do it...

Alberto, will it let you click "Create pull request" from here:
https://github.com/apache/geode/compare/revert-4937-feature_parity_redis_srem_command

> On Apr 11, 2020, at 11:09 AM, Alberto Gomez  wrote:
>
> Thanks Owen.
>
> It seems you already initiated the revert.
>
> Best regards,
>
> -Alberto
> 
> From: Owen Nichols 
> Sent: Saturday, April 11, 2020 8:04 PM
> To: dev@geode.apache.org 
> Subject: Re: Help with Concourse-CI redis integration test failing
>
> Hi Alberto, it looks like Integration tests have been broken since PR#4937 
> was merged at the end of the day Friday.  However this PR had passed all its 
> PR checks (as of 5 hours before it was merged) so I suspect this was just an 
> unexpected interaction with the PR#4912 that came in just 3 hours before it).
>
> In general there is an expectation in the Geode community that when a commit 
> breaks the pipeline, it should be reverted first, then fixed.
>
> Alberto, since I don’t think you’re a committer, how about you initiate the 
> revert and I’ll approve it?  GitHub makes this pretty easy: go to 
> https://github.com/apache/geode/pull/4937 and click Revert near the 
> bottom-right of the page then click Create pull request.
>
>> On Apr 11, 2020, at 10:01 AM, Alberto Gomez  wrote:
>>
>> Hi,
>>
>> I am seeing the following integration test case fail in the Concourse-CI 
>> when executed on my pull request (https://github.com/apache/geode/pull/4928):
>>
>> org.apache.geode.redis.SetsIntegrationTest > 
>> testSRem_should_notRemoveMembersOfSetNotSpecified FAILED
>>
>> Interestingly, if I run the test locally it passes. Also, the line of code 
>> where the failure occurs does not point to a valid line in the code I see:
>>
>> at 
>> org.apache.geode.redis.SetsIntegrationTest.testSRem_should_notRemoveMembersOfSetNotSpecified(SetsIntegrationTest.java:784
>>
>> Any idea on what could be happening?
>>
>> Thanks in advance,
>>
>> -Alberto G.
>

Re: Help with Concourse-CI redis integration test failing

2020-04-11 Thread Alberto Gomez

Owen, I managed to create the revert pull request original pull request page.

https://github.com/apache/geode/pull/4947
[https://avatars3.githubusercontent.com/u/47359?s=400&v=4]<https://github.com/apache/geode/pull/4947>
Revert "GEODE-7978: Improve tests for Redis Module SREM Command" by albertogpz 
· Pull Request #4947 · apache/geode<https://github.com/apache/geode/pull/4947>
Reverts #4937
github.com


-Alberto

________
From: Alberto Gomez 
Sent: Saturday, April 11, 2020 8:49 PM
To: dev@geode.apache.org 
Subject: Re: Help with Concourse-CI redis integration test failing

I clicked on the Create Pull Request button but got the following error:

Pull request creation failed. Validation failed: You can't perform that action 
at this time.

-Alberto


From: Owen Nichols 
Sent: Saturday, April 11, 2020 8:35 PM
To: dev@geode.apache.org 
Subject: Re: Help with Concourse-CI redis integration test failing

Looks like GitHub creates a branch for the revert as soon as someone clicks 
Revert (I didn’t mean to, I just wanted to see what the next screen looks like, 
I didn’t click Create Pull Request).

If I create the pull request, then we need to find a 3rd person to approve it, 
so hoping you can still do it...

Alberto, will it let you click "Create pull request" from here:
https://github.com/apache/geode/compare/revert-4937-feature_parity_redis_srem_command

> On Apr 11, 2020, at 11:09 AM, Alberto Gomez  wrote:
>
> Thanks Owen.
>
> It seems you already initiated the revert.
>
> Best regards,
>
> -Alberto
> 
> From: Owen Nichols 
> Sent: Saturday, April 11, 2020 8:04 PM
> To: dev@geode.apache.org 
> Subject: Re: Help with Concourse-CI redis integration test failing
>
> Hi Alberto, it looks like Integration tests have been broken since PR#4937 
> was merged at the end of the day Friday.  However this PR had passed all its 
> PR checks (as of 5 hours before it was merged) so I suspect this was just an 
> unexpected interaction with the PR#4912 that came in just 3 hours before it).
>
> In general there is an expectation in the Geode community that when a commit 
> breaks the pipeline, it should be reverted first, then fixed.
>
> Alberto, since I don’t think you’re a committer, how about you initiate the 
> revert and I’ll approve it?  GitHub makes this pretty easy: go to 
> https://github.com/apache/geode/pull/4937 and click Revert near the 
> bottom-right of the page then click Create pull request.
>
>> On Apr 11, 2020, at 10:01 AM, Alberto Gomez  wrote:
>>
>> Hi,
>>
>> I am seeing the following integration test case fail in the Concourse-CI 
>> when executed on my pull request (https://github.com/apache/geode/pull/4928):
>>
>> org.apache.geode.redis.SetsIntegrationTest > 
>> testSRem_should_notRemoveMembersOfSetNotSpecified FAILED
>>
>> Interestingly, if I run the test locally it passes. Also, the line of code 
>> where the failure occurs does not point to a valid line in the code I see:
>>
>> at 
>> org.apache.geode.redis.SetsIntegrationTest.testSRem_should_notRemoveMembersOfSetNotSpecified(SetsIntegrationTest.java:784
>>
>> Any idea on what could be happening?
>>
>> Thanks in advance,
>>
>> -Alberto G.
>

Reviewers for GEODE-7971 Gateway sender to deliver transaction events atomically to gateway receivers

2020-04-14 Thread Alberto Gomez

Hi,

Could I ask for a review on https://github.com/apache/geode/pull/4928?

This PR is about "Gateway sender to deliver transaction events atomically to 
gateway receivers" (https://issues.apache.org/jira/browse/GEODE-7971).


Thanks,

/Alberto G.

Re: Reviewers for GEODE-7971 Gateway sender to deliver transaction events atomically to gateway receivers

2020-04-16 Thread Alberto Gomez

Hi,

Friendly reminder - No reviewers yet for this PR.

-Alberto G.

From: Alberto Gomez 
Sent: Tuesday, April 14, 2020 3:34 PM
To: dev@geode.apache.org 
Subject: Reviewers for GEODE-7971 Gateway sender to deliver transaction events 
atomically to gateway receivers

Hi,

Could I ask for a review on https://github.com/apache/geode/pull/4928?

This PR is about "Gateway sender to deliver transaction events atomically to 
gateway receivers" (https://issues.apache.org/jira/browse/GEODE-7971).

Thanks,

/Alberto G.

About Geode rolling downgrade

2020-04-16 Thread Alberto Gomez

Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locator files
  - Start locator in older version

- For each server:
  - Stop server
  - Remove server files
  - Revoke missing-disk-stores for server
  - Start server in older version

Some extra details about this procedure:
- The starting and stopping of processes may not be able to be done using gfsh 
as gfsh does not allow to manage members in a different version than its own.
- Redundancy in servers is required
- More than one locator is required
- The allow_old_members_to_join_for_testing needs to be passed to the members.

I would like to ask two questions regarding this procedure:
- Do you see any issue not considered by this procedure or any alternative to 
it?
- Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

Thanks in advance for your answers.

Best regards,

-Alberto G.


[1]
 
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%3cb080e98c-5df4-e494-dcbd-383f6d979...@est.tech%3E

Re: About Geode rolling downgrade

2020-04-17 Thread Alberto Gomez

Hi Bruce,

Thanks a lot for your answer. We had not thought about the changes in 
distributed algorithms when analyzing rolling downgrades.

Rolling downgrade is a pretty important requirement for our customers so we 
would not like to close the discussion here and instead try to see if it is 
still reasonable to propose it for Geode maybe relaxing a bit the expectations 
and clarifying some things.

First, I think supporting rolling downgrade does not mean making it impossible 
to upgrade distributed algorithms. It means that you need to support the new 
and the old algorithms (just as it is done today with rolling upgrades) in the 
upgraded version and also support the possibility of switching to the old 
algorithm in a fully upgraded system.

Second of all, I would say it is not very common to upgrade distributed 
algorithms, or at least, it does not seem to have been the case so far in 
Geode. Therefore, the burden of adding the logic to support the rolling 
downgrade would not be something to be carried in every release. In my opinion, 
it will be some extra percentage of work to be added to the work to support the 
rolling upgrade of the algorithm as the rolling downgrade will probably be 
using the mechanisms implemented for the rolling upgrade.

Third, we do not need to support the rolling downgrade from any release to any 
other older release. We could just support the rolling downgrade (at least when 
distributed algorithms are changed) between consecutive versions. They could be 
considered special cases like those when it is required to provide a tool to 
convert files in order to assure compatibility.

-Alberto



From: Bruce Schuchardt 
Sent: Thursday, April 16, 2020 5:04 PM
To: dev@geode.apache.org 
Subject: Re: About Geode rolling downgrade

-1

Another reason that we should not support rolling downgrade is that it makes it 
impossible to upgrade distributed algorithms.

When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

You could support rolling downgrade in this scenario with extra logic and extra 
testing, but I don't think that will always be the case.  Rolling downgrade 
support would place an immense burden on developers in extra development and 
testing in order to ensure that older algorithms could always be brought back 
on-line.

On 4/16/20, 4:24 AM, "Alberto Gomez"  wrote:

Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locator files
  - Start locator in older version

- For each server:
  - Stop server
  - Remove server files
  - Revoke missing-disk-stores for server
  - Start server in older version

Some extra details about this procedure:
- The starting and stopping of processes may not be able to be done using 
gfsh as gfsh does not allow to manage members in a different version than its 
own.
- Redundancy in servers is required
- More than one locator is required
- The allow_old_members_to_join_for_testing needs to be passed to the 
members.

I would like to ask two questions regarding this procedure:
- Do you see any issue not considered by this procedure or any alternative 
to it?
- Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

Thanks in advance for your answers.

Best regards,

-Alberto G.


[1]
 
http

Re: Reviewers for GEODE-7971 Gateway sender to deliver transaction events atomically to gateway receivers

2020-04-24 Thread Alberto Gomez

Hi again,

The review for this PR is still pending.

I've seen that several people have been appointed to do it but there's been no 
progress yet. Could you please check it?

Thanks in advance,

-Alberto G.
____
From: Alberto Gomez 
Sent: Thursday, April 16, 2020 9:15 AM
To: dev@geode.apache.org 
Subject: Re: Reviewers for GEODE-7971 Gateway sender to deliver transaction 
events atomically to gateway receivers

Hi,

Friendly reminder - No reviewers yet for this PR.

-Alberto G.
____
From: Alberto Gomez 
Sent: Tuesday, April 14, 2020 3:34 PM
To: dev@geode.apache.org 
Subject: Reviewers for GEODE-7971 Gateway sender to deliver transaction events 
atomically to gateway receivers

Hi,

Could I ask for a review on https://github.com/apache/geode/pull/4928?

This PR is about "Gateway sender to deliver transaction events atomically to 
gateway receivers" (https://issues.apache.org/jira/browse/GEODE-7971).


Thanks,

/Alberto G.

Re: About Geode rolling downgrade

2020-05-07 Thread Alberto Gomez

Hi again,

Considering Geode does not support online rollback for the time being and since 
we have the need to rollback even a standalone system, we were thinking on a 
procedure to downgrade Geode cluster tolerating downtime, but without a need to:

  *   spin another cluster to sync from,
  *   do a restore or
  *   import data snapshot.

The procedure we came up with is:

  1.  First step - downgrade locators:

 *   While still on the newer version, export cluster configuration.
 *   Shutdown all locators. Existing clients will continue using their 
server connections. New clients/connections are not possible.
 *   Start new locators using the old SW version and import cluster 
configuration. They will form a new cluster. Existing client connections should 
still work, but new client connections are not yet possible (no servers 
connected to locators).

  1.  Second step – downgrade servers:

 *   First shutdown all servers in parallel. This marks the beginning of 
total downtime.
 *   Now start all servers in parallel but still on the new software 
version. Servers connect to the cluster formed by the downgraded locators. When 
servers are up, downtime ends. New client connections are possible. The rest of 
the rollback should be fully online.
 *   Now per server:

   i.  Shutdown 
it, revoke its disk-stores and delete its file system.

 ii.  Start 
server using old SW version. When up, server will take over cluster 
configuration and pick up replicated data and partitioned regions buckets 
satisfying region redundancy (essentially will hold exactly the same data 
previous server had).

The above has some important prerequisites:

  1.  Partitioned regions have redundancy and region configuration allows 
recovery as described above.
  2.  Clients version allows connection to new and old clusters - i.e. clients 
must not use newer version at the moment the procedure starts.
  3.  Geode guarantees cluster configuration exported from newer system can be 
imported into older system. In case of incompatibility I expect we could even 
manually edit the configuration to adapt it to the older system but it is a 
question how new servers will react when they connect (in step 2b).
  4.  Geode guarantees communication between peers with different SW version 
works and recovery of region data works.

Could we have opinions on this offline procedure? It seems to work well but 
probably has caveats we do not see at the moment.

What about prerequisites 3 and 4? It is valid in upgrade case but not sure if 
it holds in this rollback case.

Best regards,

-Alberto G.

From: Anilkumar Gingade 
Sent: Thursday, April 23, 2020 12:59 AM
To: geode 
Subject: Re: About Geode rolling downgrade

That's right, most/always no down-time requirement is managed by having
replicated cluster setups (Disaster-recovery/backup site). The data is
either pushed to both systems through the data ingesters or by using WAN
setup.
The clusters are upgraded one at a time. If there is a failure during
upgrade or needs to be rolled back; one system will be always up
and running.

-Anil.

On Wed, Apr 22, 2020 at 1:51 PM Anthony Baker  wrote:

> Anil, let me see if I understand your perspective by stating it this way:
>
> If cases where 100% uptime is a requirement, users are almost always
> running a disaster recovery site.  It could be active/active or
> active/standby but there are already at least 2 clusters with current
> copies of the data.  If an upgrade goes badly, the clusters can be
> downgraded one at a time without loss of availability.  This is because we
> ensure compatibility across the wan protocol.
>
> Is that correct?
>
>
> Anthony
>
>
>
> > On Apr 22, 2020, at 10:43 AM, Anilkumar Gingade 
> wrote:
> >
> >>> Rolling downgrade is a pretty important requirement for our customers
> >>> I'd love to hear what others think about whether this feature is worth
> > the overhead of making sure downgrades can always work.
> >
> > I/We haven't seen users/customers requesting rolling downgrade as a
> > critical requirement for them; most of the time they had both an old and
> > new setup to upgrade or switch back to an older setup.
> > Considering the amount of work involved, and code complexity it brings
> in;
> > while there are ways to downgrade, it is hard to justify supporting this
> > feature.
> >
> > -Anil.
>
>

Re: About Geode rolling downgrade

2020-05-14 Thread Alberto Gomez

Hi,

I friendly reminder to the community about this request for feedback.

Thanks,

-Alberto G.

From: Alberto Gomez 
Sent: Thursday, May 7, 2020 10:44 AM
To: geode 
Subject: Re: About Geode rolling downgrade

Hi again,

Considering Geode does not support online rollback for the time being and since 
we have the need to rollback even a standalone system, we were thinking on a 
procedure to downgrade Geode cluster tolerating downtime, but without a need to:

  *   spin another cluster to sync from,
  *   do a restore or
  *   import data snapshot.

The procedure we came up with is:

  1.  First step - downgrade locators:

 *   While still on the newer version, export cluster configuration.
 *   Shutdown all locators. Existing clients will continue using their 
server connections. New clients/connections are not possible.
 *   Start new locators using the old SW version and import cluster 
configuration. They will form a new cluster. Existing client connections should 
still work, but new client connections are not yet possible (no servers 
connected to locators).

  1.  Second step – downgrade servers:

 *   First shutdown all servers in parallel. This marks the beginning of 
total downtime.
 *   Now start all servers in parallel but still on the new software 
version. Servers connect to the cluster formed by the downgraded locators. When 
servers are up, downtime ends. New client connections are possible. The rest of 
the rollback should be fully online.
 *   Now per server:

   i.  Shutdown 
it, revoke its disk-stores and delete its file system.

 ii.  Start 
server using old SW version. When up, server will take over cluster 
configuration and pick up replicated data and partitioned regions buckets 
satisfying region redundancy (essentially will hold exactly the same data 
previous server had).

The above has some important prerequisites:

  1.  Partitioned regions have redundancy and region configuration allows 
recovery as described above.
  2.  Clients version allows connection to new and old clusters - i.e. clients 
must not use newer version at the moment the procedure starts.
  3.  Geode guarantees cluster configuration exported from newer system can be 
imported into older system. In case of incompatibility I expect we could even 
manually edit the configuration to adapt it to the older system but it is a 
question how new servers will react when they connect (in step 2b).
  4.  Geode guarantees communication between peers with different SW version 
works and recovery of region data works.

Could we have opinions on this offline procedure? It seems to work well but 
probably has caveats we do not see at the moment.

What about prerequisites 3 and 4? It is valid in upgrade case but not sure if 
it holds in this rollback case.

Best regards,

-Alberto G.

From: Anilkumar Gingade 
Sent: Thursday, April 23, 2020 12:59 AM
To: geode 
Subject: Re: About Geode rolling downgrade

That's right, most/always no down-time requirement is managed by having
replicated cluster setups (Disaster-recovery/backup site). The data is
either pushed to both systems through the data ingesters or by using WAN
setup.
The clusters are upgraded one at a time. If there is a failure during
upgrade or needs to be rolled back; one system will be always up
and running.

-Anil.

On Wed, Apr 22, 2020 at 1:51 PM Anthony Baker  wrote:

> Anil, let me see if I understand your perspective by stating it this way:
>
> If cases where 100% uptime is a requirement, users are almost always
> running a disaster recovery site.  It could be active/active or
> active/standby but there are already at least 2 clusters with current
> copies of the data.  If an upgrade goes badly, the clusters can be
> downgraded one at a time without loss of availability.  This is because we
> ensure compatibility across the wan protocol.
>
> Is that correct?
>
>
> Anthony
>
>
>
> > On Apr 22, 2020, at 10:43 AM, Anilkumar Gingade 
> wrote:
> >
> >>> Rolling downgrade is a pretty important requirement for our customers
> >>> I'd love to hear what others think about whether this feature is worth
> > the overhead of making sure downgrades can always work.
> >
> > I/We haven't seen users/customers requesting rolling downgrade as a
> > critical requirement for them; most of the time they had both an old and
> > new setup to upgrade or switch back to an older setup.
> > Considering the amount of work involved, and code complexity it brings
> in;
> > while there are ways to downgrade, it is hard to justify supporting this
> > feature.
> >
> > -Anil.
>
>

Question about version checks inside fromData method in GatewaySenderEventImpl

2020-05-19 Thread Alberto Gomez

Hi,

Looking at the fromData method of GatewaySenderEventImpl I see that it contains 
a conditional reading of the isConcurrencyConflict when version is greater than 
Geode 1.9.0 one. See below:

  @Override
  public void fromData(DataInput in,
  DeserializationContext context) throws IOException, 
ClassNotFoundException {
fromDataPre_GEODE_1_9_0_0(in, context);
if (version >= Version.GEODE_1_9_0.ordinal()) {
  this.isConcurrencyConflict = DataSerializer.readBoolean(in);
}
  }

I have looked at the implementation of this method in other classes and have 
not seen this checking of version pattern. I have also observed that if the 
"if" is removed some backward compatibility tests fail.

Could anybody tell me why this check (the if) is necessary given that there is 
already a fromDataPre_GEODE_1_9_0 method in the class?

Thanks in advance,

-Alberto G.

Re: Question about version checks inside fromData method in GatewaySenderEventImpl

2020-05-19 Thread Alberto Gomez

Hi Juan,

Thanks for your answer.

According to 
https://cwiki.apache.org/confluence/display/GEODE/Managing+Backward+Compatibility
 there are two ways to manage backward compatibility for classes that implement 
SerializationVersions.

Either implementing `toDataPre/fromDataPre` methods that Data Serialization 
will invoke based on the version of the sender/receiver (preferred way), or 
using `fromData/toData` methods using 
`InternalDataSerializer.getVersionForDataStream`.

In the case of this class, we have `toDataPre/fromDataPre` methods implemented 
so, according to what is described in the wiki, it should not be necessary to 
add any extra check to the `fromData/toData`methods. But there is this check I 
mentioned which is necessary according to some backward compatibility tests in 
Geode. So my question is why is it necessary?

Thanks,

-Alberto

From: Ju@N 
Sent: Tuesday, May 19, 2020 2:54 PM
To: dev@geode.apache.org 
Subject: Re: Question about version checks inside fromData method in 
GatewaySenderEventImpl

Hello Alberto,

It looks like the property *isConcurrencyConflict* was added as part of
*GEODE-3967* [1] and this was released as part of Geode 1.9.0; that seems
to the reason why the check is in place: if we get an instance of
*GatewaySenderEventImpl* from a member running a version higher than 1.9.0
then we are 100% sure that the serialized form will contain the new field
so we can parse it, if the serialized *GatewaySenderEventImpl *comes from
an older member the filed won't be there so we don't even try to parse it.
Hope I didn't miss anything.
Cheers.

[1]: https://issues.apache.org/jira/browse/GEODE-3967

On Tue, 19 May 2020 at 13:14, Alberto Gomez  wrote:

> Hi,
>
> Looking at the fromData method of GatewaySenderEventImpl I see that it
> contains a conditional reading of the isConcurrencyConflict when version is
> greater than Geode 1.9.0 one. See below:
>
>   @Override
>   public void fromData(DataInput in,
>   DeserializationContext context) throws IOException,
> ClassNotFoundException {
> fromDataPre_GEODE_1_9_0_0(in, context);
> if (version >= Version.GEODE_1_9_0.ordinal()) {
>   this.isConcurrencyConflict = DataSerializer.readBoolean(in);
> }
>   }
>
> I have looked at the implementation of this method in other classes and
> have not seen this checking of version pattern. I have also observed that
> if the "if" is removed some backward compatibility tests fail.
>
> Could anybody tell me why this check (the if) is necessary given that
> there is already a fromDataPre_GEODE_1_9_0 method in the class?
>
> Thanks in advance,
>
> -Alberto G.
>

--
Ju@N

Questions about patch releases and changes in serialization versions / messages

2020-05-22 Thread Alberto Gomez

Hi,

The recently approved RFC about patch releases 
(https://cwiki.apache.org/confluence/display/GEODE/Shipping+patch+releases) 
says the following about what changes should and should not be backported to a 
support branch:

What changes should be back ported to a support branch?

The community will exercise good judgement in the same way that important 
changes are cherry-picked onto release branches prior to shipping a new 
release.  Fixes related to data safety and consistency, cluster stability, or 
API behaviors are good candidates to be considered.

What changes should NOT be back ported to a support branch?

New features, refactoring changes, or less important and non-critical bug 
fixes.  Of course, you are always free to advocate within the community and 
state your case!

This raises a question on whether changes that fall on the first category 
according to the above guidelines but also contain, changes in Data 
Serialization (for example adding/removing fields from a DataSerializable 
class) would still be allowed to be backported to a support branch.

If the answer is affirmative, I wonder if/how the backward compatibility could 
be guaranteed between a newer release and a patch release in the case that both 
included the same Data Serialization change.

Example:
Imagine that 1.13.0 includes a change that adds a new field to a 
DataSerializable class. In order to support backward compatibility, the change 
will include the implementation of the corresponding fromDataPre_1_13_0 and 
toDataPre_1_13_0 methods.

Now, let's assume that this change is decided to be backported to previous 
patch releases, for example 1.12.0.1. The cherry-picked commit will need to be 
changed so that the above methods are renamed to fromDataPre_1_12_0_1 and 
toDataPre_1_12_0_1.

Problems could arise nevertheless when a Geode system on version 1.12.0.1 
(patch release) is upgraded to 1.13.0. Both Geode versions will think that the 
new field was added in their version. As a result, when a peer on version 
1.13.0 sends an instance of the modified class to a peer on version 1.12.0.1, 
it will not send the new field but, the peer on version 1.12.0.1 will expect 
it. If, on the other hand, a peer on version 1.12.0.1 sends an instance of the 
modified class to a peer on version 1.13.0, the peer on version 1.13.0 will not 
read the new field even if it was sent by the peer on 1.12.0.1.

Is there anything I am missing in my reasoning?
Has this case been contemplated?
Should these changes be prevented from support branches to avoid these problems?

Thanks in advance,

-Alberto G.

Re: Questions about patch releases and changes in serialization versions / messages

2020-05-22 Thread Alberto Gomez

Thanks for the quick answer, Owen.

Is this information published anywhere?

Cheers,

-Alberto G.

From: Owen Nichols 
Sent: Friday, May 22, 2020 5:56 PM
To: dev@geode.apache.org 
Subject: Re: Questions about patch releases and changes in serialization 
versions / messages

Serialization changes are only permitted in new minor releases (x.y.0).

> On May 22, 2020, at 4:40 AM, Alberto Gomez  wrote:
>
> Hi,
>
> The recently approved RFC about patch releases 
> (https://cwiki.apache.org/confluence/display/GEODE/Shipping+patch+releases) 
> says the following about what changes should and should not be backported to 
> a support branch:
>
> What changes should be back ported to a support branch?
>
> The community will exercise good judgement in the same way that important 
> changes are cherry-picked onto release branches prior to shipping a new 
> release.  Fixes related to data safety and consistency, cluster stability, or 
> API behaviors are good candidates to be considered.
>
> What changes should NOT be back ported to a support branch?
>
> New features, refactoring changes, or less important and non-critical bug 
> fixes.  Of course, you are always free to advocate within the community and 
> state your case!
>
> This raises a question on whether changes that fall on the first category 
> according to the above guidelines but also contain, changes in Data 
> Serialization (for example adding/removing fields from a DataSerializable 
> class) would still be allowed to be backported to a support branch.
>
> If the answer is affirmative, I wonder if/how the backward compatibility 
> could be guaranteed between a newer release and a patch release in the case 
> that both included the same Data Serialization change.
>
> Example:
> Imagine that 1.13.0 includes a change that adds a new field to a 
> DataSerializable class. In order to support backward compatibility, the 
> change will include the implementation of the corresponding 
> fromDataPre_1_13_0 and toDataPre_1_13_0 methods.
>
> Now, let's assume that this change is decided to be backported to previous 
> patch releases, for example 1.12.0.1. The cherry-picked commit will need to 
> be changed so that the above methods are renamed to fromDataPre_1_12_0_1 and 
> toDataPre_1_12_0_1.
>
> Problems could arise nevertheless when a Geode system on version 1.12.0.1 
> (patch release) is upgraded to 1.13.0. Both Geode versions will think that 
> the new field was added in their version. As a result, when a peer on version 
> 1.13.0 sends an instance of the modified class to a peer on version 1.12.0.1, 
> it will not send the new field but, the peer on version 1.12.0.1 will expect 
> it. If, on the other hand, a peer on version 1.12.0.1 sends an instance of 
> the modified class to a peer on version 1.13.0, the peer on version 1.13.0 
> will not read the new field even if it was sent by the peer on 1.12.0.1.
>
> Is there anything I am missing in my reasoning?
> Has this case been contemplated?
> Should these changes be prevented from support branches to avoid these 
> problems?
>
> Thanks in advance,
>
> -Alberto G.

CI concourse checks on a PR not triggered

2020-05-24 Thread Alberto Gomez

Hi,

Since last Friday, the concourse checks for the following PR are not being 
triggered:

https://github.com/apache/geode/pull/4928

I have tried to launch them by pushing empty commits but have not been 
successful

Could anybody give me a hand?

Thanks in advance,

-Alberto G.

Re: CI concourse checks on a PR not triggered

2020-05-25 Thread Alberto Gomez

Thanks a lot!

It is now working.

-Alberto

From: Owen Nichols 
Sent: Monday, May 25, 2020 9:03 AM
To: dev@geode.apache.org 
Subject: Re: CI concourse checks on a PR not triggered

Looks like a merge conflict: 
https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-pr/jobs/Build/builds/7594

Unfortunately the PR pipeline is unable to post statuses to the PR when there 
is a conflict.

Please rebase your PR branch to latest develop to clear this up.

On 5/24/20, 11:39 PM, "Alberto Gomez"  wrote:

Hi,

Since last Friday, the concourse checks for the following PR are not being 
triggered:

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F4928&data=02%7C01%7Conichols%40vmware.com%7C6fa72052040c4c3e0a6108d800765143%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637259855513990153&sdata=uLV8tXuoNjKmrot8nS6bAxRpQV4HmEuoWDxsWo2U9Ts%3D&reserved=0

I have tried to launch them by pushing empty commits but have not been 
successful

Could anybody give me a hand?

Thanks in advance,

-Alberto G.

Problem in rolling upgrade since 1.12

2020-06-06 Thread Alberto Gomez

Hi,

I have observed that since version 1.12 rolling upgrades to future versions 
leave the first upgraded locator "as if" it was still on version 1.12.

This is the output from "list members" before starting the upgrade from version 
1.12:

Name | Id
 | ---
vm2  | 192.168.0.37(vm2:29367:locator):41001 [Coordinator]
vm0  | 192.168.0.37(vm0:29260):41002
vm1  | 192.168.0.37(vm1:29296):41003


And this is the output from "list members" after upgrading the first locator 
from 1.12 to 1.13/1.14:

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:810):41002(version:GEODE 1.12.0)
vm1  | 192.168.0.37(vm1:849):41003(version:GEODE 1.12.0)


Finally this is the output in gfsh once the rolling upgrade has been completed 
(locators and servers upgraded):

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:2457):41002
vm1  | 192.168.0.37(vm1:2576):41003

I verified this by running manual tests and also by running the following 
upgrade test (had to stop it in the middle to connect via gfsh and get the gfsh 
outputs):
RollingUpgradeRollServersOnPartitionedRegion_dataserializable.testRollServersOnPartitionedRegion_dataserializable

After the rolling upgrade, the shutdown command fails with the following error:
Member 192.168.0.37(vm2:1453:locator):41001 could not be found.  Please 
verify the member name or ID and try again.

The only way I have found to come out of the situation is by restarting the 
locator.
Once restarted again, the output of gfsh shows that all members are upgraded to 
the new version, i.e. the locator does not show anymore that it is on version 
GEODE 1.12.0.

Anybody has any clue why this is happening?

Thanks in advance,

/Alberto G.

Re: Problem in rolling upgrade since 1.12

2020-06-08 Thread Alberto Gomez

Hi,

I attach a diff for the modified test case in case you would like to use it to 
check the problem I mentioned.

BR,

Alberto

From: Alberto Gomez 
Sent: Saturday, June 6, 2020 4:06 PM
To: dev@geode.apache.org 
Subject: Problem in rolling upgrade since 1.12

Hi,

I have observed that since version 1.12 rolling upgrades to future versions 
leave the first upgraded locator "as if" it was still on version 1.12.

This is the output from "list members" before starting the upgrade from version 
1.12:

Name | Id
 | ---
vm2  | 192.168.0.37(vm2:29367:locator):41001 [Coordinator]
vm0  | 192.168.0.37(vm0:29260):41002
vm1  | 192.168.0.37(vm1:29296):41003


And this is the output from "list members" after upgrading the first locator 
from 1.12 to 1.13/1.14:

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:810):41002(version:GEODE 1.12.0)
vm1  | 192.168.0.37(vm1:849):41003(version:GEODE 1.12.0)


Finally this is the output in gfsh once the rolling upgrade has been completed 
(locators and servers upgraded):

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:2457):41002
vm1  | 192.168.0.37(vm1:2576):41003

I verified this by running manual tests and also by running the following 
upgrade test (had to stop it in the middle to connect via gfsh and get the gfsh 
outputs):
RollingUpgradeRollServersOnPartitionedRegion_dataserializable.testRollServersOnPartitionedRegion_dataserializable

After the rolling upgrade, the shutdown command fails with the following error:
Member 192.168.0.37(vm2:1453:locator):41001 could not be found.  Please 
verify the member name or ID and try again.

The only way I have found to come out of the situation is by restarting the 
locator.
Once restarted again, the output of gfsh shows that all members are upgraded to 
the new version, i.e. the locator does not show anymore that it is on version 
GEODE 1.12.0.

Anybody has any clue why this is happening?

Thanks in advance,

/Alberto G.
diff --git a/geode-core/src/upgradeTest/java/org/apache/geode/internal/cache/rollingupgrade/RollingUpgradeDUnitTest.java b/geode-core/src/upgradeTest/java/org/apache/geode/internal/cache/rollingupgrade/RollingUpgradeDUnitTest.java
index 089b4ffd9d..a501d89a9f 100644
--- a/geode-core/src/upgradeTest/java/org/apache/geode/internal/cache/rollingupgrade/RollingUpgradeDUnitTest.java
+++ b/geode-core/src/upgradeTest/java/org/apache/geode/internal/cache/rollingupgrade/RollingUpgradeDUnitTest.java
@@ -14,6 +14,10 @@
  */
 package org.apache.geode.internal.cache.rollingupgrade;
 
+import static org.apache.geode.distributed.ConfigurationProperties.JMX_MANAGER;
+import static org.apache.geode.distributed.ConfigurationProperties.JMX_MANAGER_PORT;
+import static org.apache.geode.distributed.ConfigurationProperties.JMX_MANAGER_START;
+import static org.apache.geode.internal.AvailablePortHelper.getRandomAvailableTCPPorts;
 import static org.apache.geode.test.awaitility.GeodeAwaitility.await;
 import static org.junit.Assert.assertTrue;
 
@@ -27,6 +31,7 @@ import java.util.List;
 import java.util.Properties;
 
 import org.apache.commons.io.FileUtils;
+import org.junit.Rule;
 import org.junit.runner.RunWith;
 import org.junit.runners.Parameterized;
 import org.junit.runners.Parameterized.Parameter;
@@ -50,7 +55,6 @@ import org.apache.geode.distributed.internal.DistributionConfig;
 import org.apache.geode.distributed.internal.InternalLocator;
 import org.apache.geode.distributed.internal.membership.InternalDistributedMember;
 import org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave;
-import org.apache.geode.internal.AvailablePortHelper;
 import org.apache.geode.internal.serialization.Version;
 import org.apache.geode.test.dunit.DistributedTestUtils;
 import org.apache.geode.test.dunit.Host;
@@ -59,6 +63,7 @@ import org.apache.geode.test.dunit.Invoke;
 import org.apache.geode.test.dunit.NetworkUtils;
 import org.apache.geode.test.dunit.VM;
 import org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase;
+import org.apache.geode.test.junit.rules.GfshCommandRule;
 import org.apache.geode.test.junit.runners.CategoryWithParameterizedRunnerFactory;
 import org.apache.geode.test.version.VersionManager;
 
@@ -78,10 +83,14 @@ import org.apache.geode.test.version.VersionManager;
  * @author jhuynh
  */
 
+
 @RunWith(Parameterized.class)
 @UseParametersRunnerFactory(CategoryWithParameterizedRunnerFactory.class)
 public abstract class RollingUpgradeDUnitTest extends JUnit4DistributedTestCase {
 
+  @Rule
+  public transient GfshCommandRule gfsh = new GfshCommandRule();
+
   @Parameters

Re: Problem in rolling upgrade since 1.12

2020-06-08 Thread Alberto Gomez

Hi Ernie,

I have seen this problem in the support/1.13 branch and also on develop.

Interestingly, the patch I sent is applied seamlessly in my local repo set to 
the develop branch.

The patch modifies the 
RollingUpgradeRollServersOnPartitionedRegion_dataserializable test case by 
running "list members" on an upgraded system is 
RollingUpgradeRollServersOnPartitionedRegion_dataserializable. I run it 
manually with the following command:

./gradlew geode-core:upgradeTest 
--tests=RollingUpgradeRollServersOnPartitionedRegion_dataserializable

I see it failing when upgrading from 1.12.

I created a draft PR where you can see also the changes in the test case that 
manifest the problem.

See: https://github.com/apache/geode/pull/5224


Please, let me know if you need any more information.

BR,

Alberto

From: Ernie Burghardt 
Sent: Monday, June 8, 2020 9:04 PM
To: dev@geode.apache.org 
Subject: Re: Problem in rolling upgrade since 1.12

Hi Alberto,

I’m looking at this, came across a couple blockers…
Do you have branch that exhibits this problem? Draft PR maybe?
I tried to apply you patch to latest develop, but the patch doesn’t pass git 
apply’s check….
Also these tests pass on develop, would you be able to check against the latest 
and update the diff?
I’m very interested in reproducing the issue you have observed.

Thanks,
Ernie

From: Alberto Gomez 
Reply-To: "dev@geode.apache.org" 
Date: Monday, June 8, 2020 at 12:31 AM
To: "dev@geode.apache.org" 
Subject: Re: Problem in rolling upgrade since 1.12

Hi,

I attach a diff for the modified test case in case you would like to use it to 
check the problem I mentioned.

BR,

Alberto
________
From: Alberto Gomez 
Sent: Saturday, June 6, 2020 4:06 PM
To: dev@geode.apache.org 
Subject: Problem in rolling upgrade since 1.12

Hi,

I have observed that since version 1.12 rolling upgrades to future versions 
leave the first upgraded locator "as if" it was still on version 1.12.

This is the output from "list members" before starting the upgrade from version 
1.12:

Name | Id
 | ---
vm2  | 192.168.0.37(vm2:29367:locator):41001 [Coordinator]
vm0  | 192.168.0.37(vm0:29260):41002
vm1  | 192.168.0.37(vm1:29296):41003


And this is the output from "list members" after upgrading the first locator 
from 1.12 to 1.13/1.14:

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:810):41002(version:GEODE 1.12.0)
vm1  | 192.168.0.37(vm1:849):41003(version:GEODE 1.12.0)


Finally this is the output in gfsh once the rolling upgrade has been completed 
(locators and servers upgraded):

Name | Id
 | 

vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0) 
[Coordinator]
vm0  | 192.168.0.37(vm0:2457):41002
vm1  | 192.168.0.37(vm1:2576):41003

I verified this by running manual tests and also by running the following 
upgrade test (had to stop it in the middle to connect via gfsh and get the gfsh 
outputs):
RollingUpgradeRollServersOnPartitionedRegion_dataserializable.testRollServersOnPartitionedRegion_dataserializable

After the rolling upgrade, the shutdown command fails with the following error:
Member 192.168.0.37(vm2:1453:locator):41001 could not be found.  Please 
verify the member name or ID and try again.

The only way I have found to come out of the situation is by restarting the 
locator.
Once restarted again, the output of gfsh shows that all members are upgraded to 
the new version, i.e. the locator does not show anymore that it is on version 
GEODE 1.12.0.

Anybody has any clue why this is happening?

Thanks in advance,

/Alberto G.

Re: Problem in rolling upgrade since 1.12

2020-06-11 Thread Alberto Gomez

Thanks for the info, Bill.

I have found another issue in rolling upgrade since 1.12.

I have observed that when there are custom jars previously deployed, the 
locator is not able be started in the new version and the following exception 
is thrown:

Exception in thread "main" org.apache.geode.SerializationException: Could not 
create an instance of 
org.apache.geode.management.internal.configuration.domain.Configuration .

I have pushed another commit in the draft I sent before containing the new test 
case.

/Alberto G.

From: Bill Burcham 
Sent: Thursday, June 11, 2020 1:53 AM
To: dev@geode.apache.org 
Subject: Re: Problem in rolling upgrade since 1.12

Ernie made us a ticket for this issue:
https://issues.apache.org/jira/browse/GEODE-8240

On Mon, Jun 8, 2020 at 12:59 PM Alberto Gomez 
wrote:

> Hi Ernie,
>
> I have seen this problem in the support/1.13 branch and also on develop.
>
> Interestingly, the patch I sent is applied seamlessly in my local repo set
> to the develop branch.
>
> The patch modifies the
> RollingUpgradeRollServersOnPartitionedRegion_dataserializable test case by
> running "list members" on an upgraded system is
> RollingUpgradeRollServersOnPartitionedRegion_dataserializable. I run it
> manually with the following command:
>
> ./gradlew geode-core:upgradeTest
> --tests=RollingUpgradeRollServersOnPartitionedRegion_dataserializable
>
> I see it failing when upgrading from 1.12.
>
> I created a draft PR where you can see also the changes in the test case
> that manifest the problem.
>
> See: https://github.com/apache/geode/pull/5224
>
>
> Please, let me know if you need any more information.
>
> BR,
>
> Alberto
> 
> From: Ernie Burghardt 
> Sent: Monday, June 8, 2020 9:04 PM
> To: dev@geode.apache.org 
> Subject: Re: Problem in rolling upgrade since 1.12
>
> Hi Alberto,
>
> I’m looking at this, came across a couple blockers…
> Do you have branch that exhibits this problem? Draft PR maybe?
> I tried to apply you patch to latest develop, but the patch doesn’t pass
> git apply’s check….
> Also these tests pass on develop, would you be able to check against the
> latest and update the diff?
> I’m very interested in reproducing the issue you have observed.
>
> Thanks,
> Ernie
>
> From: Alberto Gomez 
> Reply-To: "dev@geode.apache.org" 
> Date: Monday, June 8, 2020 at 12:31 AM
> To: "dev@geode.apache.org" 
> Subject: Re: Problem in rolling upgrade since 1.12
>
> Hi,
>
> I attach a diff for the modified test case in case you would like to use
> it to check the problem I mentioned.
>
> BR,
>
> Alberto
> 
> From: Alberto Gomez 
> Sent: Saturday, June 6, 2020 4:06 PM
> To: dev@geode.apache.org 
> Subject: Problem in rolling upgrade since 1.12
>
> Hi,
>
> I have observed that since version 1.12 rolling upgrades to future
> versions leave the first upgraded locator "as if" it was still on version
> 1.12.
>
> This is the output from "list members" before starting the upgrade from
> version 1.12:
>
> Name | Id
>  | ---
> vm2  | 192.168.0.37(vm2:29367:locator):41001 [Coordinator]
> vm0  | 192.168.0.37(vm0:29260):41002
> vm1  | 192.168.0.37(vm1:29296):41003
>
>
> And this is the output from "list members" after upgrading the first
> locator from 1.12 to 1.13/1.14:
>
> Name | Id
>  |
> 
> vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0)
> [Coordinator]
> vm0  | 192.168.0.37(vm0:810):41002(version:GEODE 1.12.0)
> vm1  | 192.168.0.37(vm1:849):41003(version:GEODE 1.12.0)
>
>
> Finally this is the output in gfsh once the rolling upgrade has been
> completed (locators and servers upgraded):
>
> Name | Id
>  |
> 
> vm2  | 192.168.0.37(vm2:1453:locator):41001(version:GEODE 1.12.0)
> [Coordinator]
> vm0  | 192.168.0.37(vm0:2457):41002
> vm1  | 192.168.0.37(vm1:2576):41003
>
> I verified this by running manual tests and also by running the following
> upgrade test (had to stop it in the middle to connect via gfsh and get the
> gfsh outputs):
>
> RollingUpgradeRollServersOnPartitionedRegion_dataserializable.testRollServersOnPartitionedRegion_dataserializable
>
> After the rolling upgrade, the shutdown command fails with the following
> error:
> Member 192.168.0.37(vm2:1453:locator):41001 could not be found.
> Please verify the member name or ID and try again.
>
> The only way I have found to come out of the situation is by restarting
> the locator.
> Once restarted again, the output of gfsh shows that all members are
> upgraded to the new version, i.e. the locator does not show anymore that it
> is on version GEODE 1.12.0.
>
> Anybody has any clue why this is happening?
>
> Thanks in advance,
>
> /Alberto G.
>

Re: About Geode rolling downgrade

2020-06-12 Thread Alberto Gomez

Hi Naba!

Did you manage to comment this topic with some engineers?

Cheers,

/Alberto G.

From: Nabarun Nag 
Sent: Friday, June 5, 2020 11:00 AM
To: dev@geode.apache.org 
Subject: Re: About Geode rolling downgrade

Hi Mario and Alberto,

I will sync up with couple of engineers get you a feedback within a couple of 
days.

@Barry , Jason and I were discussing once, can your idea of WAN GII achieve the 
downgrade. Like create a DS with old versions and let it do a GII from the 
newer version cluster and then shutdown the new version DS. Now we have a DS 
with lower version.

Regards
Naba

From: Mario Ivanac 
Sent: Friday, June 5, 2020 1:19:42 AM
To: geode 
Subject: Odg: About Geode rolling downgrade

Hi all,

just a reminder that Alberto is still waiting for feedback,
regarding his question.

BR,
Mario

Šalje: Alberto Gomez 
Poslano: 14. svibnja 2020. 14:45
Prima: geode 
Predmet: Re: About Geode rolling downgrade

Hi,

I friendly reminder to the community about this request for feedback.

Thanks,

-Alberto G.

From: Alberto Gomez 
Sent: Thursday, May 7, 2020 10:44 AM
To: geode 
Subject: Re: About Geode rolling downgrade

Hi again,

Considering Geode does not support online rollback for the time being and since 
we have the need to rollback even a standalone system, we were thinking on a 
procedure to downgrade Geode cluster tolerating downtime, but without a need to:

  *   spin another cluster to sync from,
  *   do a restore or
  *   import data snapshot.

The procedure we came up with is:

  1.  First step - downgrade locators:

 *   While still on the newer version, export cluster configuration.
 *   Shutdown all locators. Existing clients will continue using their 
server connections. New clients/connections are not possible.
 *   Start new locators using the old SW version and import cluster 
configuration. They will form a new cluster. Existing client connections should 
still work, but new client connections are not yet possible (no servers 
connected to locators).

  1.  Second step – downgrade servers:

 *   First shutdown all servers in parallel. This marks the beginning of 
total downtime.
 *   Now start all servers in parallel but still on the new software 
version. Servers connect to the cluster formed by the downgraded locators. When 
servers are up, downtime ends. New client connections are possible. The rest of 
the rollback should be fully online.
 *   Now per server:

   i.  Shutdown 
it, revoke its disk-stores and delete its file system.

 ii.  Start 
server using old SW version. When up, server will take over cluster 
configuration and pick up replicated data and partitioned regions buckets 
satisfying region redundancy (essentially will hold exactly the same data 
previous server had).

The above has some important prerequisites:

  1.  Partitioned regions have redundancy and region configuration allows 
recovery as described above.
  2.  Clients version allows connection to new and old clusters - i.e. clients 
must not use newer version at the moment the procedure starts.
  3.  Geode guarantees cluster configuration exported from newer system can be 
imported into older system. In case of incompatibility I expect we could even 
manually edit the configuration to adapt it to the older system but it is a 
question how new servers will react when they connect (in step 2b).
  4.  Geode guarantees communication between peers with different SW version 
works and recovery of region data works.

Could we have opinions on this offline procedure? It seems to work well but 
probably has caveats we do not see at the moment.

What about prerequisites 3 and 4? It is valid in upgrade case but not sure if 
it holds in this rollback case.

Best regards,

-Alberto G.

From: Anilkumar Gingade 
Sent: Thursday, April 23, 2020 12:59 AM
To: geode 
Subject: Re: About Geode rolling downgrade

That's right, most/always no down-time requirement is managed by having
replicated cluster setups (Disaster-recovery/backup site). The data is
either pushed to both systems through the data ingesters or by using WAN
setup.
The clusters are upgraded one at a time. If there is a failure during
upgrade or needs to be rolled back; one system will be always up
and running.

-Anil.

On Wed, Apr 22, 2020 at 1:51 PM Anthony Baker  wrote:

> Anil, let me see if I understand your perspective by stating it this way:
>
> If cases where 100% uptime is a requirement, users are almost always
> running a disaster recovery site.  It could be active/active or
> active/standby but there are already at least 2 clusters with current
> copies of the data.  If an upgrade goes badly, the clust

Reviewers for GEODE-8231: C++ native client keeps trying to connect to down cache server hosting a partitioned region

2020-06-15 Thread Alberto Gomez

Hi,

Can someone please review my PR about 
https://issues.apache.org/jira/browse/GEODE-8231 (C++ native client keeps 
trying to connect to down cache server hosting a partitioned region)?

Here is the link to the PR: https://github.com/apache/geode-native/pull/615

Thanks,

/Alberto G.

Heap memory used by gateway sender queues way above configured limit after server restart

2020-06-18 Thread Alberto Gomez

Hi,

I have found an issue with heap memory consumed by gateway sender queues way 
above the configured limit after a server is restarted (on the restarted 
server).

The problem is described in the following ticket:

https://issues.apache.org/jira/browse/GEODE-8278

I would highly appreciate some help from the community on where to look in the 
code (or any other hint) in order to implement a solution.

Thanks in advance,

Alberto G.

Re: Reviewers for GEODE-8231: C++ native client keeps trying to connect to down cache server hosting a partitioned region

2020-06-22 Thread Alberto Gomez

Hi,

I have no complete reviews yet. Any volunteers?

Thanks,

Alberto

From: Alberto Gomez 
Sent: Monday, June 15, 2020 1:31 PM
To: dev@geode.apache.org 
Subject: Reviewers for GEODE-8231: C++ native client keeps trying to connect to 
down cache server hosting a partitioned region

Hi,

Can someone please review my PR about 
https://issues.apache.org/jira/browse/GEODE-8231 (C++ native client keeps 
trying to connect to down cache server hosting a partitioned region)?

Here is the link to the PR: https://github.com/apache/geode-native/pull/615

Thanks,

/Alberto G.

Re: Successful build on windows

2020-06-25 Thread Alberto Gomez

Hi Kirk,

I build on Ubuntu 18.02 and I occasionally see the partial stack traces you 
mentioned on geode-wan:tests you mentioned. So it is not just a Windows thing.

Never figured out what they provoked them and neither how to get them 
consistently.

BR,

Alberto


From: Kirk Lund 
Sent: Thursday, June 25, 2020 11:53 PM
To: dev@geode.apache.org 
Subject: Successful build on windows

In case anyone is interested in the developer experience building with unit
tests on windows:

It succeeds (after a couple tries) but something in geode-wan:test spits
out a partial stack trace. Since all the tests passed, I don't really see a
way to find out which test generated it.

C:\Users\kirkl\dev\geode>gradlew.bat build



*> Task :geode-wan:testat
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue$BatchRemovalThread.checkCancelled(ParallelGatewaySenderQueue.java:1780)
  at
org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue$BatchRemovalThread.run(ParallelGatewaySenderQueue.java:1879)*

> Task :combineReports
All test reports at C:\Users\kirkl\dev\geode\build/reports/combined

Deprecated Gradle features were used in this build, making it incompatible
with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See
https://docs.gradle.org/5.4/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 3m 52s
532 actionable tasks: 97 executed, 435 up-to-date

Re: Fate of master branch

2020-06-26 Thread Alberto Gomez

I agree also on removing the master branch.

As a relatively new member of the community it's been a source of confusion to 
me when looking at what is said in the wiki about it 
(https://cwiki.apache.org/confluence/display/GEODE/Versioning+and+Branching) 
and comparing it with the actual practice.

Alberto G.

From: Jacob Barrett 
Sent: Friday, June 26, 2020 5:26 PM
To: dev@geode.apache.org 
Subject: Re: Fate of master branch

I am 100% in favor or dropping the master branch completely. I felt like it was 
always a source of confusion. Was it the most recent release or the latest 
version number. I know we have had issues with even correctly merging the 
latest version back into it sometimes.

I really can’t see any reason for keeping it around.

-Jake

> On Jun 26, 2020, at 8:05 AM, Blake Bender  wrote:
>
> Apologies if this has been addressed already and I missed it.  In keeping 
> with other OSS projects, I believe it’s time we did something about removing 
> the insensitive term master from Geode repositories.
>
> One choice a lot of projects appear to be going with is a simple rename from 
> master • main.  In our own case, however, master isn’t really in use for 
> anything vital.  We track releases with a tag and a branch to backport fixes 
> to, and the develop branch is the “source of truth” latest-and-greatest 
> version of the code.  We could thus simply delete master with no loss I’m 
> aware of.  Any opinions?
>
> Thanks,
>
> Blake
>

Question about gateway sender stopped and memory consumption

2020-07-02 Thread Alberto Gomez

Hi,

We have observed that when a gateway sender is stopped in a site, all the 
events received while it is stopped are stored in the 
'AbstractGatewaySender.tmpDroppedEvents' queue of the primary sender. The 
elements of this queue are not removed from this queue until the sender is 
started back again.

This behavior implies that if the gateway sender is stopped for a long time, 
there is a risk of heap exhaustion in the members hosting primary senders.

Under split brain situations, if lasting long enough, there could be heap 
exhaustion problems in servers due to the memory used by the gateway sender 
queues, even if overflow to disk is used -given that part of the event is 
always stored in memory.
For those situations we had thought about stopping gateway senders when the 
memory used by the gateway sender queues reached a certain memory threshold. 
But according to the above, stopping the gateway senders would only make things 
worse.

Would it make sense for the gateway sender not to store the received events in 
tmpDroppedEvents while it is stopped?

Any suggestion on how to approach the problem of heap exhaustion due to the 
growth of gateway sender queues in long lasting split brain situations?

Thanks in advance,

Alberto G.

Re: Question about gateway sender stopped and memory consumption

2020-07-02 Thread Alberto Gomez

Thanks for your answer, Kirk.

If we persist the unsent events in a persistent region then the memory consumed 
would not be as high but still it would not solve our problem with long lasting 
split brain as the persistent region would take some memory too to store those 
events even if they were overflown.

Ideally it should be backed up in a queue that does not use any memory.

Best regards,

Alberto

From: Kirk Lund 
Sent: Thursday, July 2, 2020 7:41 PM
To: dev@geode.apache.org 
Subject: Re: Question about gateway sender stopped and memory consumption

I would have expected unsent events to be stored in a queue that is backed
by a persistent region or something on disk. If that's not currently true,
then it seems like a good direction might be to make tmpDroppedEvents use a
durable queue of some sort that overflows to disk.

On Thu, Jul 2, 2020 at 10:33 AM Alberto Gomez 
wrote:

> Hi,
>
> We have observed that when a gateway sender is stopped in a site, all the
> events received while it is stopped are stored in the
> 'AbstractGatewaySender.tmpDroppedEvents' queue of the primary sender. The
> elements of this queue are not removed from this queue until the sender is
> started back again.
>
> This behavior implies that if the gateway sender is stopped for a long
> time, there is a risk of heap exhaustion in the members hosting primary
> senders.
>
> Under split brain situations, if lasting long enough, there could be heap
> exhaustion problems in servers due to the memory used by the gateway sender
> queues, even if overflow to disk is used -given that part of the event is
> always stored in memory.
> For those situations we had thought about stopping gateway senders when
> the memory used by the gateway sender queues reached a certain memory
> threshold. But according to the above, stopping the gateway senders would
> only make things worse.
>
> Would it make sense for the gateway sender not to store the received
> events in tmpDroppedEvents while it is stopped?
>
> Any suggestion on how to approach the problem of heap exhaustion due to
> the growth of gateway sender queues in long lasting split brain situations?
>
> Thanks in advance,
>
> Alberto G.
>
>
>

Re: Question about gateway sender stopped and memory consumption

2020-07-02 Thread Alberto Gomez

Thanks Juan!

I will check it.

Alberto

From: Ju@N 
Sent: Thursday, July 2, 2020 7:46 PM
To: dev@geode.apache.org 
Subject: Re: Question about gateway sender stopped and memory consumption

I recall some discussion about this in the past, there even was an "RFC"
that never got implemented:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452478.
Best regards.

On Thu, 2 Jul 2020 at 18:41, Kirk Lund  wrote:

> I would have expected unsent events to be stored in a queue that is backed
> by a persistent region or something on disk. If that's not currently true,
> then it seems like a good direction might be to make tmpDroppedEvents use a
> durable queue of some sort that overflows to disk.
>
>
>
> On Thu, Jul 2, 2020 at 10:33 AM Alberto Gomez 
> wrote:
>
> > Hi,
> >
> > We have observed that when a gateway sender is stopped in a site, all the
> > events received while it is stopped are stored in the
> > 'AbstractGatewaySender.tmpDroppedEvents' queue of the primary sender. The
> > elements of this queue are not removed from this queue until the sender
> is
> > started back again.
> >
> > This behavior implies that if the gateway sender is stopped for a long
> > time, there is a risk of heap exhaustion in the members hosting primary
> > senders.
> >
> > Under split brain situations, if lasting long enough, there could be heap
> > exhaustion problems in servers due to the memory used by the gateway
> sender
> > queues, even if overflow to disk is used -given that part of the event is
> > always stored in memory.
> > For those situations we had thought about stopping gateway senders when
> > the memory used by the gateway sender queues reached a certain memory
> > threshold. But according to the above, stopping the gateway senders would
> > only make things worse.
> >
> > Would it make sense for the gateway sender not to store the received
> > events in tmpDroppedEvents while it is stopped?
> >
> > Any suggestion on how to approach the problem of heap exhaustion due to
> > the growth of gateway sender queues in long lasting split brain
> situations?
> >
> > Thanks in advance,
> >
> > Alberto G.
> >
> >
> >
>


--
Ju@N

[DISCUSS] RFC - Avoid the queueing of dropped events by the primary gateway sender when the gateway sender is stopped

2020-07-06 Thread Alberto Gomez

Hi,

I have published a new RFC in the Apache Geode wiki with the following title: 
"Avoid the queueing of dropped events by the primary gateway sender when the 
gateway sender is stopped".

https://cwiki.apache.org/confluence/display/GEODE/Avoid+the+queuing+of+dropped+events+by+the+primary+gateway+sender+when+the+gateway+sender+is+stopped

Could you please give comments by Thursday, July 9th, 2020?

Thanks in advance,

Alberto G.

Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the primary gateway sender when the gateway sender is stopped

2020-07-08 Thread Alberto Gomez

Thanks for your comments, Eric.

Limiting the size of the queue would be a simple solution but I think it would 
pose several problems on the the one configuring and operating Geode:

  *   How big should the queue be? Probably not easy to dimension. Should the 
limit by on the memory occupied by the elements or on the number of elements in 
the queue (in which case, depending on the size of the elements, the memory 
used could vary a lot)?
  *   What  to do when the limit has been reached? how do we notify that it was 
reached, what to do afterwards, how would we know what dropped events did not 
make it to the queue but should have been removed from the secondary's queue...

I think the solution proposed in the RFC is simple enough and also addresses a 
possible confusion with the semantics of the gateway sender stop command.
Stopping a gateway sender currently makes that all events received while the 
sender is stopped are dropped; but at the same time, unlimited memory may be 
consumed by the dropped events. We could put a limit on the amount of memory 
used by the queued dropped events but what would be the point in the first 
place to store them if those events will not be sent to the remote site anyway?
I would expect that after stopping a gateway sender no resources (or at least a 
minimal part) would be consumed by it. Otherwise we may as well not stop it or 
use the pause command depending on what we want to achieve.

>From what I have seen, queuing dropped events has its place while the gateway 
>sender is starting and while it is stopping but if it is done in a sender to 
>be started manually or in a manually stopped server it could provoke an 
>unexpected memory exhaustion.

I really think the solution proposed makes the behavior of the gateway sender 
command more logical.

Best regards,

Alberto

From: Eric Shu 
Sent: Wednesday, July 8, 2020 7:32 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

It seems that I was not able to comment on the RFC in the wiki yet.

Just try to find out if we have a simple solution for the issue you raised -- 
can we have a up-limit for the tmpDroppedEvents queue in question?

Always check the limit before adding to the queue -- so that the tmp queue is 
not unbound?

Regards,
Eric
________
From: Alberto Gomez 
Sent: Monday, July 6, 2020 8:24 AM
To: geode 
Subject: [DISCUSS] RFC - Avoid the queueing of dropped events by the primary 
gateway sender when the gateway sender is stopped

Hi,

I have published a new RFC in the Apache Geode wiki with the following title: 
"Avoid the queueing of dropped events by the primary gateway sender when the 
gateway sender is stopped".

https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGEODE%2FAvoid%2Bthe%2Bqueuing%2Bof%2Bdropped%2Bevents%2Bby%2Bthe%2Bprimary%2Bgateway%2Bsender%2Bwhen%2Bthe%2Bgateway%2Bsender%2Bis%2Bstopped&data=02%7C01%7Ceshu%40vmware.com%7Cf4d61d141c014854f4c508d821c0a78e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637296458615861191&sdata=Nqd%2FeUxXR713XIzn5KRg4x2V6CJIGHSgTEEwlTEzryk%3D&reserved=0

Could you please give comments by Thursday, July 9th, 2020?

Thanks in advance,

Alberto G.

Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the primary gateway sender when the gateway sender is stopped

2020-07-09 Thread Alberto Gomez

Hi Eric,

I agree that the only case in which the memory issue may occur is when all 
gateway senders instances are stopped. And that is what the solution proposed 
in the RFC is targeted at, and also that is why the stop gateway sender command 
is intended to be updated to fix the issue.

Note that while stopping all the gateway sender instances, there may be events 
stored in the secondary senders that will be dropped by the primary sender. 
Those dropped events need to be queued while the secondaries are still up so 
that when the sender is started again, the secondary's queues would be drained 
accordingly.
If we go for the option of setting a limit on the dropped events, if set too 
small, there could be dropped events that should have been queued but weren't 
due to having reached the limit and which would not be sent to the secondaries 
to drain their queues completely (this is the case in which I meant that a 
notification must be sent to the operator of the system so that he knows that a 
possible issue is present in the system: queues with events that would stay 
there forever). On the other hand, if the limit is too high, the memory 
consumed by the queued dropped events could cause a problem of memory 
exhaustion.

I think the right balance is to stop queueing dropped events when all the 
gateway sender instances are stopped.

BR,

Alberto


From: Eric Shu 
Sent: Wednesday, July 8, 2020 9:25 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

I think the only case the memory issue occurred is when all gateway senders are 
stopped in the wan-site. Otherwise another member would assume to be the 
primary queue. No more events will be enqueued in tmpDroppedEvents on the 
member with original primary queue. (For parallel wan queue, I do not think 
stop one gateway queue is a valid case to support.)


For all gateway senders are stopped case, no need to notify any other members 
in the wan site if the limit is reached. The tmpDroppedEvents is only used for 
remove events on the secondary queue. If no events are enqueued in the 
secondary queue, there is no need to add into tmpDroppedEvents at all. To me, 
it should be only used for limited events to be queued.

Regards,
Eric
____
From: Alberto Gomez 
Sent: Wednesday, July 8, 2020 12:02 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

Thanks for your comments, Eric.

Limiting the size of the queue would be a simple solution but I think it would 
pose several problems on the the one configuring and operating Geode:

  *   How big should the queue be? Probably not easy to dimension. Should the 
limit by on the memory occupied by the elements or on the number of elements in 
the queue (in which case, depending on the size of the elements, the memory 
used could vary a lot)?
  *   What  to do when the limit has been reached? how do we notify that it was 
reached, what to do afterwards, how would we know what dropped events did not 
make it to the queue but should have been removed from the secondary's queue...

I think the solution proposed in the RFC is simple enough and also addresses a 
possible confusion with the semantics of the gateway sender stop command.
Stopping a gateway sender currently makes that all events received while the 
sender is stopped are dropped; but at the same time, unlimited memory may be 
consumed by the dropped events. We could put a limit on the amount of memory 
used by the queued dropped events but what would be the point in the first 
place to store them if those events will not be sent to the remote site anyway?
I would expect that after stopping a gateway sender no resources (or at least a 
minimal part) would be consumed by it. Otherwise we may as well not stop it or 
use the pause command depending on what we want to achieve.

>From what I have seen, queuing dropped events has its place while the gateway 
>sender is starting and while it is stopping but if it is done in a sender to 
>be started manually or in a manually stopped server it could provoke an 
>unexpected memory exhaustion.

I really think the solution proposed makes the behavior of the gateway sender 
command more logical.

Best regards,

Alberto

From: Eric Shu 
Sent: Wednesday, July 8, 2020 7:32 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

It seems that I was not able to comment on the RFC in the wiki yet.

Just try to find out if we have a simple solution for the issue you raised -- 
can we have a up-limit for the tmpDroppedEvents queue in question?

Always check the limit before adding to the queue

Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the primary gateway sender when the gateway sender is stopped

2020-07-09 Thread Alberto Gomez

Hi Alexander,

Yes, sure. I am extending the deadline for comments to next Thursday, July the 
16th.

Cheers,

Alberto G.

From: Alexander Murmann 
Sent: Thursday, July 9, 2020 1:42 AM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

Hi Alberto,

The timing on this RFC feels really tight. Would you be open to extending
this to next week?

On Wed, Jul 8, 2020 at 1:04 PM Eric Shu  wrote:

> I think the only case the memory issue occurred is when all gateway
> senders are stopped in the wan-site. Otherwise another member would assume
> to be the primary queue. No more events will be enqueued in
> tmpDroppedEvents on the member with original primary queue. (For parallel
> wan queue, I do not think stop one gateway queue is a valid case to
> support.)
>
> For all gateway senders are stopped case, no need to notify any other
> members in the wan site if the limit is reached. The tmpDroppedEvents is
> only used for remove events on the secondary queue. If no events are
> enqueued in the secondary queue, there is no need to add into
> tmpDroppedEvents at all. To me, it should be only used for limited events
> to be queued.
>
> Regards,
> Eric
> ____
> From: Alberto Gomez 
> Sent: Wednesday, July 8, 2020 12:02 PM
> To: dev@geode.apache.org 
> Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the
> primary gateway sender when the gateway sender is stopped
>
> Thanks for your comments, Eric.
>
> Limiting the size of the queue would be a simple solution but I think it
> would pose several problems on the the one configuring and operating Geode:
>
>   *   How big should the queue be? Probably not easy to dimension. Should
> the limit by on the memory occupied by the elements or on the number of
> elements in the queue (in which case, depending on the size of the
> elements, the memory used could vary a lot)?
>   *   What  to do when the limit has been reached? how do we notify that
> it was reached, what to do afterwards, how would we know what dropped
> events did not make it to the queue but should have been removed from the
> secondary's queue...
>
> I think the solution proposed in the RFC is simple enough and also
> addresses a possible confusion with the semantics of the gateway sender
> stop command.
> Stopping a gateway sender currently makes that all events received while
> the sender is stopped are dropped; but at the same time, unlimited memory
> may be consumed by the dropped events. We could put a limit on the amount
> of memory used by the queued dropped events but what would be the point in
> the first place to store them if those events will not be sent to the
> remote site anyway?
> I would expect that after stopping a gateway sender no resources (or at
> least a minimal part) would be consumed by it. Otherwise we may as well not
> stop it or use the pause command depending on what we want to achieve.
>
> From what I have seen, queuing dropped events has its place while the
> gateway sender is starting and while it is stopping but if it is done in a
> sender to be started manually or in a manually stopped server it could
> provoke an unexpected memory exhaustion.
>
> I really think the solution proposed makes the behavior of the gateway
> sender command more logical.
>
> Best regards,
>
> Alberto
> 
> From: Eric Shu 
> Sent: Wednesday, July 8, 2020 7:32 PM
> To: dev@geode.apache.org 
> Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the
> primary gateway sender when the gateway sender is stopped
>
> It seems that I was not able to comment on the RFC in the wiki yet.
>
> Just try to find out if we have a simple solution for the issue you raised
> -- can we have a up-limit for the tmpDroppedEvents queue in question?
>
> Always check the limit before adding to the queue -- so that the tmp queue
> is not unbound?
>
> Regards,
> Eric
> 
> From: Alberto Gomez 
> Sent: Monday, July 6, 2020 8:24 AM
> To: geode 
> Subject: [DISCUSS] RFC - Avoid the queueing of dropped events by the
> primary gateway sender when the gateway sender is stopped
>
> Hi,
>
> I have published a new RFC in the Apache Geode wiki with the following
> title: "Avoid the queueing of dropped events by the primary gateway sender
> when the gateway sender is stopped".
>
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGEODE%2FAvoid%2Bthe%2Bqueuing%2Bof%2Bdropped%2Bevents%2Bby%2Bthe%2Bprimary%2Bgateway%2Bsender%2Bwhen%2Bth

Re: [Proposal] - RFC etiquette

2020-07-10 Thread Alberto Gomez

Hi Geode Devs,

First of all, Udo, thanks for your proposal. I am all up for what you are 
aiming at: "better round out each RFC. Causing less delays later in the process 
and allowing all community members to actively participate in the review 
process regardless of technical skill level."

Secondly, I think I am to blame for having given two little time to review the 
latest RFC I have published. I apologize for it. I felt the changes were too 
small, assumed that the solution was not problematic and as a result gave less 
than a week to review which I now think is too little even if the RFC content 
was small. This has probably triggered Udo's proposal so, in a way, it has not 
been such a bad thing 😉.

Regarding the concrete proposal to achieve the goal, I think the 2 week minimum 
period is very reasonable. The new use case section may help to have more 
community members actively participating but I am not sure that it will be the 
definitive measure. I feel that sometimes the lack of participation comes from 
lack of time because we're busy with other things and not so much with how the 
RFC proposal has been written. Anyhow, having an example of what this new 
section should look like would be helpful for new RFCs to be written.

Alberto


From: Udo Kohlmeyer 
Sent: Thursday, July 9, 2020 10:18 PM
To: geode 
Subject: [Proposal] - RFC etiquette

Hi there Geode Dev's

I would like to propose the following changes to the RFC process that we have 
in place at the moment.

  1.  All submitted RFC’s will provide a minimum 2 week review period. This is 
to allow the community to review the RFC in a reasonable timeframe. If we rush 
things, we will miss things. I’d rather have a little more time spent on the 
RFC review and getting the approach “correct” than rushing the RFC and then at 
a later point in time (either at PR review or worse production issue) find out 
that the approach was less than optimal.
  2.  Add a new section to the RFC. I would like to propose this section to be 
labelled “Use Cases”. In this section I would like all submitters to describe 
the use case that this RFC is to fulfill. This would include all possible 
combinations (success and failure) and expected outcomes of each.

I hope with the additions to the RFC process and template we can better round 
out each RFC. Causing less delays later in the process and allowing all 
community members to actively participate in the review process regardless of 
technical skill level.

Thoughts or comments?

—Udo

Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the primary gateway sender when the gateway sender is stopped

2020-07-10 Thread Alberto Gomez

Hi Xiaojian,

No problem, I had already extended the deadline for comments to next Thursday 
(July the 16th). If more time is needed to get all the relevant comments, we 
can extend it further.

Thanks,

Alberto

From: Xiaojian Zhou 
Sent: Friday, July 10, 2020 6:32 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

Hi, Alberto:

I was the original author who introduced the tmpDroppedEvents. Due to other 
work, I only got chance to read the issue on Thursday, which is your deadline. 
Can you hold on a little bit longer to Monday?

I have been thinking of history of the code changes and issues you encountered. 
I will try to find a light-weight solution with minimum impact to current code.

Regards
Xiaojian Zhou

On 7/8/20, 1:05 PM, "Eric Shu"  wrote:

I think the only case the memory issue occurred is when all gateway senders 
are stopped in the wan-site. Otherwise another member would assume to be the 
primary queue. No more events will be enqueued in tmpDroppedEvents on the 
member with original primary queue. (For parallel wan queue, I do not think 
stop one gateway queue is a valid case to support.)

For all gateway senders are stopped case, no need to notify any other 
members in the wan site if the limit is reached. The tmpDroppedEvents is only 
used for remove events on the secondary queue. If no events are enqueued in the 
secondary queue, there is no need to add into tmpDroppedEvents at all. To me, 
it should be only used for limited events to be queued.

Regards,
Eric

From: Alberto Gomez 
Sent: Wednesday, July 8, 2020 12:02 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

Thanks for your comments, Eric.

Limiting the size of the queue would be a simple solution but I think it 
would pose several problems on the the one configuring and operating Geode:

  *   How big should the queue be? Probably not easy to dimension. Should 
the limit by on the memory occupied by the elements or on the number of 
elements in the queue (in which case, depending on the size of the elements, 
the memory used could vary a lot)?
  *   What  to do when the limit has been reached? how do we notify that it 
was reached, what to do afterwards, how would we know what dropped events did 
not make it to the queue but should have been removed from the secondary's 
queue...

I think the solution proposed in the RFC is simple enough and also 
addresses a possible confusion with the semantics of the gateway sender stop 
command.
Stopping a gateway sender currently makes that all events received while 
the sender is stopped are dropped; but at the same time, unlimited memory may 
be consumed by the dropped events. We could put a limit on the amount of memory 
used by the queued dropped events but what would be the point in the first 
place to store them if those events will not be sent to the remote site anyway?
I would expect that after stopping a gateway sender no resources (or at 
least a minimal part) would be consumed by it. Otherwise we may as well not 
stop it or use the pause command depending on what we want to achieve.

From what I have seen, queuing dropped events has its place while the 
gateway sender is starting and while it is stopping but if it is done in a 
sender to be started manually or in a manually stopped server it could provoke 
an unexpected memory exhaustion.

I really think the solution proposed makes the behavior of the gateway 
sender command more logical.

Best regards,

Alberto

From: Eric Shu 
Sent: Wednesday, July 8, 2020 7:32 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

It seems that I was not able to comment on the RFC in the wiki yet.

Just try to find out if we have a simple solution for the issue you raised 
-- can we have a up-limit for the tmpDroppedEvents queue in question?

Always check the limit before adding to the queue -- so that the tmp queue 
is not unbound?

Regards,
Eric

    From: Alberto Gomez 
Sent: Monday, July 6, 2020 8:24 AM
To: geode 
Subject: [DISCUSS] RFC - Avoid the queueing of dropped events by the 
primary gateway sender when the gateway sender is stopped

Hi,

I have published a new RFC in the Apache Geode wiki with the following 
title: "Avoid the queueing of dropped events by the primary gateway sender when 
the gateway sender is stopped".


https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fcon

Review needed for PR "Different behavior in transactions on partitioned regions between creating the region with a parallel gateway sender vs altering the region to add the parallel gateway sender"

2020-08-26 Thread Alberto Gomez

Hi Geode devs,

I'd a appreciate some reviews for PR https://github.com/apache/geode/pull/5476 
related to GEODE-8455 "Different behavior in transactions on partitioned 
regions between creating the region with a parallel gateway sender vs altering 
the region to add the parallel gateway sender".
[https://avatars3.githubusercontent.com/u/47359?s=400&v=4]
GEODE-8455: Fix difference between create region with gw sender and a… by 
albertogpz · Pull Request #5476 · 
apache/geode
…lter region with gw sender Geode behaves differently with respect to 
transactions when creating a partitioned region with a parallel gateway sender 
to when first the partitioned region is created ...
github.com


Thanks in advance,

Alberto

Re: [Discussion] - ClassLoaderService RFC proposal

2020-09-15 Thread Alberto Gomez

Nice proposal, Udo.

Here come some questions:

Is the ClassLoader isolation RFC implemented? I have not seen any references to 
it in the doc or code. To me this RFC seems like a part of the ClassLoader 
isolation RFC as, without it, the original one would not work completely. Is 
this right?

If I understand correctly, to start with there will be two implementations of 
the ClassLoaderService, the DefaultClassLoaderService (with the current 
ClassLoader functionality) and another one with the modular ClassLoader 
functionality as provided by JBoss modules. The latter one will be used when 
the --experimental-classloader flag is used in GFSH. Is this right?

Thanks,

-Alberto G.

From: Udo Kohlmeyer 
Sent: Monday, September 14, 2020 12:42 PM
To: geode 
Subject: [Discussion] - ClassLoaderService RFC proposal

Hi there Apache Geode Devs, (try 2)

Please find attached a proposal for a ClassLoaderService. Please review and 
ponder on it.

https://cwiki.apache.org/confluence/display/GEODE/Introduction+of+ClassLoaderService+into+Geode

All comments are please to be made in this mail thread.

—Udo

Re: PR process and etiquette

2020-10-28 Thread Alberto Gomez

+1 to draft PRs.

By the way @Blake Bender, it's me the one having the 
draft PR for GEODE-8318.

Alberto G.

From: Blake Bender 
Sent: Wednesday, October 28, 2020 2:28 PM
To: dev@geode.apache.org 
Subject: Re: PR process and etiquette

+1 for draft PRs.  Native has been using these for a few months now, and 
they're quite effective.  Right now, for example, we have 6 PRs up, 3 of which 
are draft.  They also turn out to be a convenient way to share work, in certain 
circumstances.  Mario, for instance, has a draft up for GEODE-8318 that is 
strictly WIP.  By having it up as a draft PR, I get notifications when changes 
are pushed, and can run internal tooling and let him know if I find issues.

Thanks,

Blake

On 10/28/20, 1:03 AM, "Udo Kohlmeyer"  wrote:

Great information Darrel. Thank you for sharing that.

--Udo

From: Darrel Schneider 
Date: Wednesday, October 28, 2020 at 3:32 PM
To: dev@geode.apache.org 
Subject: Re: PR process and etiquette
+1 to your idea of using "draft" mode until things are green. Something to 
be aware of is that if your pr branch has conflicts and it is in draft mode 
then your pr tests will not run and the pr page will not tell you that 
conflicts exist. If you see that the pr tests are not actually running and it 
is in draft mode then try merging develop to your pr branch and resolve the 
conflicts.

From: Owen Nichols 
Sent: Tuesday, October 27, 2020 6:03 PM
To: dev@geode.apache.org 
Subject: Re: PR process and etiquette

+1 for using GitHub's draft status to indicate work-in-progress.

Many great suggestions here, however I generally prefer that we don't 
squash commits at any point except the final Squash and Merge to develop.  I 
find it insightful to see how the work evolved.  I also find that review 
comments may start coming in even before you are "ready" for review, and a 
squash or force-push "loses" those comments.

One thing I would like to see more of is PR summaries that explain *why* 
the change is being made, not just *what* is being changed.

Thanks Udo for looking for ways to make the community process work even 
better!

On 10/27/20, 5:41 PM, "Udo Kohlmeyer"  wrote:

Dear Apache Geode Devs,
It is really great going through all the PRs that been submitted. As 
Josh Long is known to say: "I work for PRs".
Whilst going through some of the PRs I do see that there are many PRs 
that have multiple commits against the PR.
I know that the PR submission framework kicks off more testing than we 
do on our own local machines. It is extremely uncommon to submit a PR the first 
time and have all tests go green. Which means we invariably iterate over 
commits to make the build go green.
In this limbo time period, it is hard for any reviewer to know when the 
ticket is ready to be reviewed.
I want to propose that when submitting a PR, it is initially submitted 
as a DRAFT PR. This way, all test can still be run and work can be done to make 
sure "green" is achieved. Once "green" status has been achieved, the draft can 
then be upgraded to a final PR by pressing the "Ready For Review" button. At 
this point all commits on the branch can then once again be squashed into a 
single commit.
Now project committers will now know that the PR is in a state that it 
can be reviewed for completeness and functionality.
In addition, it will help tremendously helpful if anyone submitting a 
PR monitors their PR for activity. If there is no activity for a few days, 
please feel free to ping the Apache Geode Dev community for review. If review 
is request, please prioritize some time to address the feedback, as reviewers 
spend time reviewing code and getting an understanding what the code is doing. 
If too much time goes by, between review and addressing the review comments, 
not only does the reviewer lose context, possibly requiring them to spend time 
again to understand what the code was/is supposed to do, but also possibly lose 
interest, as the ticket has now become cold or dropped down the list of PRs.
There are currently many PRs that are in a cold state, where the time 
between review and response has been so long, that both parties (reviewer and 
submitter) have forgotten about the PR.
In the case that the reviews will take more time to address than 
expected, please downgrade the PR to draft status again. At this point, it does 
not mean that reviewers will not be able to help anymore, as you can request a 
reviewer to help with feedback and comments, until one feels that the PR is 
back in a state of final submission.
So, what I'm really asking from the Dev Community:
If you submit a PR, it would be great if you can nudge the 
community if there is no review on the PR. If feedback is provided on a PR, 
please addr

Re: PR process and etiquette

2020-10-29 Thread Alberto Gomez

Hi there,

Here come my 2 cents.

@Udo Kohlmeyer, thanks for your proposals to make this 
community better, and also for your willingness to get feedback from people who 
are new to the community.

In my experience, one of the tricky parts working in the community is getting 
reviewers for PRs. It is a bit of a mystery what will happen once you submit a 
PR. Sometimes you get a review in a few hours. Sometimes you do not and ask in 
the list for reviewers and after that, sometimes you get reviewers soon, and 
sometimes you don't, and you need to insist in the list. I have sometimes asked 
for a review to a particular person via e-mail as I do not have permissions to 
assign reviewers to PRs and sometimes have not received any answer.
I figure the response time is very dependent on how busy people. Anyhow, it is 
the uncertainty of what is going on behind the scenes what makes things hard.
If any proposal makes the review process more predictable, I am up for it. I 
think Udo's reflections to come up with a consistent approach to the review 
process are very valuable.

In my opinion, it is a good idea to submit draft PRs while we do not have the 
green light from the CI. I have many times submitted PRs, gone to sleep just to 
realize the morning after that some test cases in the CI failed (either due to 
flaky test cases or due to my changes). Sometimes I had already gotten a review 
and I would have preferred to have it once the CI was clear. Other times I did 
not get a review and I wondered if that (or those) failures would keep 
reviewers away from my PR given that they once looked at it and had test cases 
failures.

Alberto G.

From: Udo Kohlmeyer 
Sent: Thursday, October 29, 2020 1:50 AM
To: dev@geode.apache.org 
Subject: Re: PR process and etiquette

So far I would like to thank everyone for their thoughts and input.

@Dave, I would love to find a solution to the partial sign-off. I’ve been 
experimenting with the “Projects” setting. I wonder if we cannot have a 
“Documentation Check” project, that is added to every PR as a default project. 
We could have different states with the project, which would allow the docs 
folk to know what PRs are new and which still need to be reviewed for docs 
changes.

Now, I don’t know if we can restrict the merging of a PR based upon a state in 
the Project, but at the very least it will provide the ability to have an 
overview of PR with/without docs review. You can have a look at the “Quality 
Review” project I have created. Which I use to track all PRs that I would like 
to review for quality purposes. (code, structure, tests, etc)… I think Docs 
could have something similar.

@Bruce, I’m not trying to create another rule for the sake of creating a rule. 
Why do you believe that we as a community will give any submitter a stink-eye 
just because they did not submit a draft? I certainly would not. I would 
suggest that the submitter maybe submit a draft IF the PR is not in a ready 
state and needs a few more iterations to get to a ready state.

I believe it is easier and better for committers to go through a list of PRs to 
review if they know that the PR passes all of the testing checks.. As a failure 
in one area might actually cause some code components to change. Which might 
void an earlier review of the code. Also, I’m not suggesting that there are no 
reviews before the commit checks go green. You can easily request someone else 
to review whilst in a draft state.

As for knowing what reviewers to tag for a review is more limiting. How would I 
as a new PR submitter know WHO I should tag in the PR? Over time we have built 
up a great understanding of who might be a good person to review our code. But 
for a new community member, they do not know this. For them, they submit the 
PR, and someone in the community will review it.

I would also like everyone to think back on their own approach on deciding what 
PRs to review.

Do you look at the PR and decide to wait until all commit checks are green?
Do you go through the list and find one, that you think you can review, whilst 
the commit checks are still running?
Do you only review PRs in which you have been explicitly tagged?
Do you scan the PRs for a commit in an area of “expertise”?
Do you scan the PRs for committers that you know?

Whatever approach we take, I would like us to come up with an approach, that we 
as a community follow, to have a consistent approach to the review.
A consistent way we can evaluate if the code is in a “ready” state?
A consistent way, that the community will know, that when they submit the PR it 
will be looked at.
A consistent way that I, as a committer, will know that if I spend the time to 
review the PR will not be a waste of my time, because it wasn’t ready.

I don’t think community members are repulsed by a project with structure, but I 
do know that I question a project without structure and one where it takes a 
l

Re: Please review and contribute: draft of Nov 2020 Apache board report

2020-11-10 Thread Alberto Gomez

Hi Karen,

According to the membership data I'd say the Committer-to-PMC ratio is closer 
to 2:1 than to 7:4.

Alberto

From: Karen Miller 
Sent: Monday, November 9, 2020 8:25 PM
To: dev@geode.apache.org 
Subject: Please review and contribute: draft of Nov 2020 Apache board report

All, our board report is due in less than 48 hours.  I've included a
first draft below.
Please help correct my mistakes and let me know of blog posts and presentations
that are not yet on the list.

I think that we might also add to our Project Activity category mentions of the
community's focus.  Would it be a good idea to add something like this?
  - We're actively working on v1.13.1, which will contain many bug fixes.

Since the report is due quite soon, please get corrections/additions
to me before
Tuesday Nov 10 (tomorrow) at 3pm Pacific time.
Thanks.
Karen Miller
I work for VMware.
This email is written in my capacity as Chair of the Apache Geode PMC.

## Description:
The mission of Apache Geode is the creation and maintenance of software related
to a data management platform that provides real-time, consistent access to
data-intensive applications throughout widely distributed cloud architectures.

## Issues:
There are no issues requiring board attention.

## Membership Data:
Apache Geode was founded 2016-11-15 (4 years ago)
There are currently 110 committers and 54 PMC members in this project.
The Committer-to-PMC ratio is roughly 7:4.

Community changes, past quarter:
- No new PMC members. Last addition was Alexander Murmann on 2020-03-26.
- Sarah Abbey was added as committer on 2020-09-29

## Project Activity:
- Apache Geode v1.13.0 was released on 2020-09-10.
- 10 community members presented 8 talks at ApacheCon 2020. See
  https://www.youtube.com/playlist?list=PLU2OcwpQkYCxKxd7dVETcwEtx5AEDIp1j for
  a playlist that includes all 8 talks.

## Community Health:
- 259 issues opened in JIRA, past quarter (-15% decrease)
- 212 issues closed in JIRA, past quarter (-16% decrease)
- 463 commits in the past quarter (-19% decrease)
- 57 code contributors in the past quarter (-6% decrease)
- 324 PRs opened on GitHub, past quarter (-16% decrease)
- 325 PRs closed on GitHub, past quarter (-14% decrease)

Review for "C++ native client Function.execute() with onServers does not throw exception if one of the servers goes down while executing the function."

2020-11-16 Thread Alberto Gomez

Hi,

Could somebody review PR https://github.com/apache/geode-native/pull/690  
(https://issues.apache.org/jira/browse/GEODE-8693?filter=-2).

Thanks,

/Alberto G.

apache-geode-1.13.0.tgz not found in LGTM analysis

2020-11-19 Thread Alberto Gomez

Hi,

I am getting the following error in the LGTM analysis of some pull requests 
since yesterday (for example https://github.com/apache/geode-native/pull/690):

[2020-11-19 07:25:41] [build-err] + wget -O apache-geode.tgz 
http://mirror.transip.net/apache/geode/1.13.0/apache-geode-1.13.0.tgz
[2020-11-19 07:25:41] [build-err] --2020-11-19 07:25:41--  
http://mirror.transip.net/apache/geode/1.13.0/apache-geode-1.13.0.tgz
[2020-11-19 07:25:41] [build-err] Resolving mirror.transip.net 
(mirror.transip.net)... 149.210.210.109, 2a01:7c8:1337::100
[2020-11-19 07:25:41] [build-err] Connecting to mirror.transip.net 
(mirror.transip.net)|149.210.210.109|:80... connected.
[2020-11-19 07:25:41] [build-err] HTTP request sent, awaiting response... 404 
Not Found

It seems the tgz file is not available anymore.

Any idea how to fix it?

Thanks,

/Alberto G.

Re: apache-geode-1.13.0.tgz not found in LGTM analysis

2020-11-19 Thread Alberto Gomez

Thanks for the info, Owen.

I have created a JIRA and a PR to update the .lgtm.yml file in the geode-native 
repo: https://github.com/apache/geode-native/pull/698

Any volunteer to review it?

BR,

Alberto

From: Owen Nichols 
Sent: Thursday, November 19, 2020 11:50 AM
To: dev@geode.apache.org 
Subject: Re: apache-geode-1.13.0.tgz not found in LGTM analysis

It looks like it was hardcoded[1] that way recently.  Geode 1.13.1 was just 
announced[2] so you are correct, 1.13.0 is archived and no longer on the 
mirrors.

If maintaining a hardcoded Geode version number in geode-native is necessary, 
the set_versions[3] script should be updated to keep it in sync.

[1] https://github.com/apache/geode-native/blame/develop/.lgtm.yml
[2] 
https://lists.apache.org/x/thread.html/rf937beb3783dc7f2e27a2618586d8cacd8b231793cccab863f4632e3@%3Cdev.geode.apache.org%3E
[3] 
https://github.com/apache/geode/blob/develop/dev-tools/release/set_versions.sh

-Owen

From: Alberto Gomez 
Date: Thursday, November 19, 2020 at 2:39 AM
To: dev@geode.apache.org 
Subject: apache-geode-1.13.0.tgz not found in LGTM analysis
Hi,

I am getting the following error in the LGTM analysis of some pull requests 
since yesterday (for example 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F690&data=04%7C01%7Conichols%40vmware.com%7C67b7059e6f8945508ddc08d88c7751f0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637413791402422145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HY8tdLeIfM20hv7PLX%2B%2BYcStTD%2Fq334X7UXF6umRVL8%3D&reserved=0):

[2020-11-19 07:25:41] [build-err] + wget -O apache-geode.tgz 
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmirror.transip.net%2Fapache%2Fgeode%2F1.13.0%2Fapache-geode-1.13.0.tgz&data=04%7C01%7Conichols%40vmware.com%7C67b7059e6f8945508ddc08d88c7751f0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637413791402422145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mGRuhIujL%2F%2FRo4EciDy3sC7uLwJJiR7UYDTMEJGDf%2BA%3D&reserved=0
[2020-11-19 07:25:41] [build-err] --2020-11-19 07:25:41--  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmirror.transip.net%2Fapache%2Fgeode%2F1.13.0%2Fapache-geode-1.13.0.tgz&data=04%7C01%7Conichols%40vmware.com%7C67b7059e6f8945508ddc08d88c7751f0%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637413791402422145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mGRuhIujL%2F%2FRo4EciDy3sC7uLwJJiR7UYDTMEJGDf%2BA%3D&reserved=0
[2020-11-19 07:25:41] [build-err] Resolving mirror.transip.net 
(mirror.transip.net)... 149.210.210.109, 2a01:7c8:1337::100
[2020-11-19 07:25:41] [build-err] Connecting to mirror.transip.net 
(mirror.transip.net)|149.210.210.109|:80... connected.
[2020-11-19 07:25:41] [build-err] HTTP request sent, awaiting response... 404 
Not Found

It seems the tgz file is not available anymore.

Any idea how to fix it?

Thanks,

/Alberto G.

Review for GEODE-8765: Fix NullPointerException when group-transaction-events and events in and not in transactions are sent.

2020-12-10 Thread Alberto Gomez

Hi,

Could I get some reviewers for PR: https://github.com/apache/geode/pull/5829

Thanks in advance,

/Alberto G.

Question about -Dgemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION

2020-12-10 Thread Alberto Gomez

Hi,

I have recently discovered the 
"gemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION" Geode System property 
that allows to change the default behavior of Gateway Senders so that when an 
exception occurs when handling an event, instead of proceeding with the rest of 
events in the batch and sending back to the sender an exception for the event 
(which is ignored), the Gateway Receiver sleeps for half a second and retries 
to apply the event until it succeeds.

As this property is not in the documentation (as far as I know) and it can only 
be activated by using the above property when starting the servers hosting 
gateway senders, I would like to know if it is safe to use, i.e. it will not be 
removed in the future and also if there are any considerations to make when 
using it, given that it is not Geode's default behavior.

I have noticed that there are test cases specifically testing this property 
(See KeepEventsOnGatewaySenderQueueDUnitTest.hava).

Thanks in advance,

/Alberto G.

Re: Question about -Dgemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION

2020-12-11 Thread Alberto Gomez

Barry, thanks a lot for the clarifications.

Best regards,

/Alberto G.

From: Barrett Oglesby 
Sent: Thursday, December 10, 2020 8:20 PM
To: dev@geode.apache.org 
Subject: Re: Question about 
-Dgemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION

Alberto,

There are a lot of applications that use this property, so I wouldn't expect it 
to be removed.

Here is some additional detail regarding this property:

The default behavior is if a sender in one site can connect successfully to a 
receiver in another and send batches to it, the acks received for those batches 
cause them to be removed from the sender queue. It doesn't matter what happens 
on that remote site. Every event could fail; every event could succeed. If an 
ack is received, the batch is removed.

The gemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION system property 
reverses that behavior and keep batches on the sender queue until all their 
events are applied successfully by the receiver in the remote site.

The default behavior can cause sites to be out of sync, and the system property 
behavior can cause infinite retries and OOM in the sender if the receiver in 
the remote site never successfully processes a batch.

There is a draft proposal for a callback that is a middle ground between these 
two behaviors, but it hasn't been implemented at this point.

Barry
____
From: Alberto Gomez 
Sent: Thursday, December 10, 2020 7:58 AM
To: dev@geode.apache.org 
Subject: Question about -Dgemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION

Hi,

I have recently discovered the 
"gemfire.GatewaySender.REMOVE_FROM_QUEUE_ON_EXCEPTION" Geode System property 
that allows to change the default behavior of Gateway Senders so that when an 
exception occurs when handling an event, instead of proceeding with the rest of 
events in the batch and sending back to the sender an exception for the event 
(which is ignored), the Gateway Receiver sleeps for half a second and retries 
to apply the event until it succeeds.

As this property is not in the documentation (as far as I know) and it can only 
be activated by using the above property when starting the servers hosting 
gateway senders, I would like to know if it is safe to use, i.e. it will not be 
removed in the future and also if there are any considerations to make when 
using it, given that it is not Geode's default behavior.

I have noticed that there are test cases specifically testing this property 
(See KeepEventsOnGatewaySenderQueueDUnitTest.hava).

Thanks in advance,

/Alberto G.

[DISCUSS] RFC - Add option to allow newer Geode clients to connect to older Geode servers

2021-01-22 Thread Alberto Gomez

Hi Geode devs,

I have just published the following RFC in the Geode wiki: "Add option to allow 
newer Geode clients to connect to older Geode servers"

https://cwiki.apache.org/confluence/display/GEODE/Add+option+to+allow+newer+Geode+clients+to+connect+to+older+Geode+servers

Could you please provide feedback by Tuesday, February 2nd, 2021?

Thanks,

Alberto G.

Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect to older Geode servers

2021-01-22 Thread Alberto Gomez

Thanks for your comments, Patrick.

Do you mean have the client always use in the handshake the oldest server 
version it is compatible with?

Sounds like a reasonable simplification. In that case, I would use a flag to 
activate this behavior so that the current behavior (the client sends the 
current version in the handshake) is kept when the flag is not used.

On the other hand if in the future we have clients that are partially 
compatible with an older server version, the System Property with the version 
could allow these clients to connect to that server version assuming that they 
will not use any incompatible feature.

Alberto

From: Patrick Johnson 
Sent: Friday, January 22, 2021 8:35 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

It sounds like you intend to test which versions are compatible with each other 
and maintain a list the client can use to reject the setting of force-version 
when set to an incompatible version. If that’s the case, why not just have the 
handshake look at that list and automatically connect with any versions that it 
is known to be compatible with? Then you wouldn’t even have to set the property.

> On Jan 22, 2021, at 11:05 AM, Alberto Gomez  wrote:
>
> Hi Geode devs,
>
> I have just published the following RFC in the Geode wiki: "Add option to 
> allow newer Geode clients to connect to older Geode servers"
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGEODE%2FAdd%2Boption%2Bto%2Ballow%2Bnewer%2BGeode%2Bclients%2Bto%2Bconnect%2Bto%2Bolder%2BGeode%2Bservers&data=04%7C01%7Cjpatrick%40vmware.com%7C13575e2f7095498aaf0608d8bf08be8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637469391602573044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fgNljW8GTiY3FfVSsnAIe943XHpnMRjLZKSDzmf5Fpk%3D&reserved=0
>
> Could you please provide feedback by Tuesday, February 2nd, 2021?
>
> Thanks,
>
> Alberto G.
>

Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect to older Geode servers

2021-01-26 Thread Alberto Gomez

Hi,

I have updated the proposal in the RFC by adding Patrick's suggestion (if I 
have understood it correctly).

Best regards,

Alberto

From: Alberto Gomez 
Sent: Friday, January 22, 2021 10:41 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

Thanks for your comments, Patrick.

Do you mean have the client always use in the handshake the oldest server 
version it is compatible with?

Sounds like a reasonable simplification. In that case, I would use a flag to 
activate this behavior so that the current behavior (the client sends the 
current version in the handshake) is kept when the flag is not used.

On the other hand if in the future we have clients that are partially 
compatible with an older server version, the System Property with the version 
could allow these clients to connect to that server version assuming that they 
will not use any incompatible feature.

Alberto

From: Patrick Johnson 
Sent: Friday, January 22, 2021 8:35 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

It sounds like you intend to test which versions are compatible with each other 
and maintain a list the client can use to reject the setting of force-version 
when set to an incompatible version. If that’s the case, why not just have the 
handshake look at that list and automatically connect with any versions that it 
is known to be compatible with? Then you wouldn’t even have to set the property.

> On Jan 22, 2021, at 11:05 AM, Alberto Gomez  wrote:
>
> Hi Geode devs,
>
> I have just published the following RFC in the Geode wiki: "Add option to 
> allow newer Geode clients to connect to older Geode servers"
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGEODE%2FAdd%2Boption%2Bto%2Ballow%2Bnewer%2BGeode%2Bclients%2Bto%2Bconnect%2Bto%2Bolder%2BGeode%2Bservers&data=04%7C01%7Cjpatrick%40vmware.com%7C13575e2f7095498aaf0608d8bf08be8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637469391602573044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fgNljW8GTiY3FfVSsnAIe943XHpnMRjLZKSDzmf5Fpk%3D&reserved=0
>
> Could you please provide feedback by Tuesday, February 2nd, 2021?
>
> Thanks,
>
> Alberto G.
>

Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect to older Geode servers

2021-01-29 Thread Alberto Gomez

Hi Dan,

Thanks a lot for your comments.

The scope of the RFC is not very ambitious. As I pointed out in it, the idea is 
not to implement the backward compatibility of clients with older servers. 
Rather, the aim is to allow to take advantage of the fact that serialization or 
other types of changes that may break this compatibility are not very frequent. 
For those cases where there have been no incompatible changes, with one of the 
proposed System Properties, it would be possible for a client to communicate 
with an older compatible server without the need of implementing anything 
extra. And we would have the test cases in place to assure this. For those 
cases where compatibility has been broken, it will not be possible to 
communicate the client with the older server and we would also have the tests 
showing that this communication is not possible even if the proposed System 
Property is used.

I do not know how costly it would be to implement and maintain the alternative 
approach you suggest with the negotiation required to support full backward 
compatibility. I would leave that to a different RFC. The good thing is that 
the current RFC could serve as a first step to implement the second, if it is 
agreed that this second feature is worth of being put in Geode.

Best regards,

Alberto

From: Dan Smith 
Sent: Friday, January 29, 2021 1:56 AM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

I think just sending the old version will only work until we actually make any 
changes to the protocol. Once we do, serialization will break unless we also 
change the client to pretend to be that old version, including the way it 
serializes and deserializes messages. With this proposal there will be no way 
for the client to use new features with a newer server since the version number 
of the client is set with a system property.

An alternative would be to have the client and the server need a way to 
negotiate which protocol they are going to communicate over. We do this already 
for WAN. WAN senders can be a higher version than receivers, otherwise we 
couldn't upgrade an Active/Active WAN. What happens is that the WAN receiver 
will accept a newer versioned client, and it sends back its own version. The 
client reads the receivers version and adjusts accordingly. You can see this in 
ClientSideHandshakeImpl.handshakeWithServer.

This will require a lot of testing to make sure that users won't see strange 
corruption related errors related to serialization changes.

-Dan
____
From: Alberto Gomez 
Sent: Tuesday, January 26, 2021 6:45 AM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

Hi,

I have updated the proposal in the RFC by adding Patrick's suggestion (if I 
have understood it correctly).

Best regards,

Alberto
____
From: Alberto Gomez 
Sent: Friday, January 22, 2021 10:41 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

Thanks for your comments, Patrick.

Do you mean have the client always use in the handshake the oldest server 
version it is compatible with?

Sounds like a reasonable simplification. In that case, I would use a flag to 
activate this behavior so that the current behavior (the client sends the 
current version in the handshake) is kept when the flag is not used.

On the other hand if in the future we have clients that are partially 
compatible with an older server version, the System Property with the version 
could allow these clients to connect to that server version assuming that they 
will not use any incompatible feature.

Alberto

From: Patrick Johnson 
Sent: Friday, January 22, 2021 8:35 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

It sounds like you intend to test which versions are compatible with each other 
and maintain a list the client can use to reject the setting of force-version 
when set to an incompatible version. If that’s the case, why not just have the 
handshake look at that list and automatically connect with any versions that it 
is known to be compatible with? Then you wouldn’t even have to set the property.

> On Jan 22, 2021, at 11:05 AM, Alberto Gomez  wrote:
>
> Hi Geode devs,
>
> I have just published the following RFC in the Geode wiki: "Add option to 
> allow newer Geode clients to connect to older Geode servers"
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGEODE%2FAdd%2Boption%2Bto%2Ballow%2Bnewer%2BGeode%2Bclients%2Bto%2Bconnect%2Bto%2Bolder%2BGeode%2Bservers&data=04%7C01%7Cda

Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect to older Geode servers

2021-02-04 Thread Alberto Gomez

After thinking about the pros and cons of the RFC solution and the 
alternatives, taking into account the comments received, I have decided to move 
it to the Icebox and not implement it for the time being.

Thanks all for the valuable feedback.

Best regards,

Alberto G.

From: Bruce Schuchardt 
Sent: Tuesday, February 2, 2021 5:41 PM
To: dev@geode.apache.org 
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to connect 
to older Geode servers

Oh, but I forgot about WAN changes that may have been made to the handshake to 
allow different versions in different clusters.  Jake might be right about this.

On 2/2/21, 8:31 AM, "Bruce Schuchardt"  wrote:

I think it's only the locator connections that do this.  Regular 
client->server connections using the handshake code just send the client's 
current version, which must not be newer than the server's version.

On 2/1/21, 9:53 AM, "Jacob Barrett"  wrote:

Having just spent some time yanking out some of the really really old 
version support I think a naive version knocking approach would work. During 
the client handshake the server will reject and close the connection of any 
client with a newer version number than it supports. The client could use this 
as signal to downgrade its version and try again. This could continue until the 
server accepts the client. We would need to decide if we would expect the 
entire membership to be a the same versions or if the version knocking needs to 
be on a per member basis. Obviously knocking for every connection is not ideal 
so some sort heuristic should be maintained for the life of the client.

Interestingly enough the clients sort of did this up until the merge of 
this version cleanups. All clients first made connections using the very old 
protocol version so that the server would send its version back. Then the 
client would disconnect and reconnect using its current version. The same could 
be done today with the current protocol version, the clients could make first 
connection with v1.0.0, get the server version, close and reconnect identifying 
themselves at the same server version.

-Jake


On Jan 29, 2021, at 3:35 PM, Dan Smith 
mailto:dasm...@vmware.com>> wrote:

Well, I have no objection to adding a system property for this if you 
want to try it. Since those properties aren't technically part of the public 
API I don't think we need to offer full support for what happens when the 
setting breaks. I'm just thinking ahead to what will happen when the protocol 
does change. At that point setting the system property will not work, unless 
the client has the capability to negotiate and discover the server version and 
use the old protocol the way that WAN does.

Do keep in mind that failures may not be obvious if the serialization 
protocol changes and your client is pretending to be a different version. I 
think it's possible that the errors might show up only in log messages or 
corrupted values, and only if you are using whatever features are affected by a 
protocol change.

-Dan

From: Alberto Gomez 
mailto:alberto.go...@est.tech>>
Sent: Friday, January 29, 2021 11:40 AM
To: dev@geode.apache.org<mailto:dev@geode.apache.org> 
mailto:dev@geode.apache.org>>
Subject: Re: [DISCUSS] RFC - Add option to allow newer Geode clients to 
connect to older Geode servers

Hi Dan,

Thanks a lot for your comments.

The scope of the RFC is not very ambitious. As I pointed out in it, the 
idea is not to implement the backward compatibility of clients with older 
servers. Rather, the aim is to allow to take advantage of the fact that 
serialization or other types of changes that may break this compatibility are 
not very frequent. For those cases where there have been no incompatible 
changes, with one of the proposed System Properties, it would be possible for a 
client to communicate with an older compatible server without the need of 
implementing anything extra. And we would have the test cases in place to 
assure this. For those cases where compatibility has been broken, it will not 
be possible to communicate the client with the older server and we would also 
have the tests showing that this communication is not possible even if the 
proposed System Property is used.

I do not know how costly it would be to implement and maintain the 
alternative approach you suggest with the negotiation required to support full 
backward compatibility. I would leave that to a different RFC. The good thing 
is that the current RFC could serve as a first step to implement the second, if 
it is agreed that this second feature is worth of being put in Geode.

Best regards,

Alberto

Question about Map indexes

2021-02-11 Thread Alberto Gomez

Hi,

We have observed that creating an index on a Map field causes the creation of 
an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
Nevertheless, we would expect that only entries whose Map field contain the key 
used in the index would have the corresponding index entry. With this behavior, 
the memory consumed by the index could be much higher than needed depending on 
the percentage of entries whose Map field contained the key in the index.

---
Example:
We have a region with entries whose key type is a String and the value type is 
an object with a field called "field1" of Map type.

We expect to run queries on the region like the following:

SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

We create a Map index to speed up the above queries:

gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

We do the following puts:
- Put entry with key="key1" and with value=
- Put entry with key="key2" and with value=

The observation is that Geode creates two index entries for each entry. For the 
first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

These are the stats shown by gfsh after doing the above puts:

gfsh>list indexes --with-stats=yes
Member Name |Member ID|   Region Path|   
Name   | Type  | Indexed Expression  |From Clause | Valid Index | Uses 
| Updates | Update Time | Keys | Values
--- | --- |  | 
 | - | - | -- | 
--- |  | --- | --- |  | --
server1 | 192.168.0.26(server1:1109606):41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 1   
 | 1   | 0   | 1| 1
server2 | 192.168.0.26(server2:1109695):41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 1   
 | 1   | 0   | 1| 1
---

Is there any reason why Geode would create an index entry for the second entry 
given that the Map field does not contain the key in the Map index?

I have created a draft pull request changing the behavior of Geode to not 
create the index entry when the Map field does not contain the key used in the 
index. Only two Unit test cases had to be adjusted. Please see: 
https://github.com/apache/geode/pull/6028

With this change and the same scenario as the one in the example, only one 
index entry is created. The stats shown by gfsh after the change are the 
following:

gfsh>list indexes --with-stats=yes
Member Name |Member ID|   Region Path|   
Name   | Type  | Indexed Expression  |From Clause | Valid Index | Uses 
| Updates | Update Time | Keys | Values
--- | --- |  | 
 | - | - | -- | 
--- |  | --- | --- |  | --
server1 | 192.168.0.26(server1:1102192):41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 2   
 | 1   | 0   | 0| 0
server2 | 192.168.0.26(server2:1102279):41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 2   
 | 1   | 0   | 1| 1


Could someone tell if the current behavior is not correct or if I am missing 
something and with the change I am proposing something else will stop working?

Thanks in advance,

/Alberto G.

Re: Question about Map indexes

2021-02-13 Thread Alberto Gomez

Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without 
(apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between 
the case where the Map does not contain the key from the case where the Map 
contains the key but the value for the key is null. If we use Map.get() we get 
in both cases null but we should only add the index entry in the latter case 
(when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary 
exception to be able to distinguish both cases. Anyway, could you please check 
if we are in the right direction?

Thanks,

Alberto



From: Jason Huynh 
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org 
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought 
that comes to mind is when someone does a != query.  The index still has to 
supply the correct answer to the query (all entries with null or undefined 
values possibly)

I'll try to think of other cases where it might matter.  There may be other 
ways to execute the query but it would probably take a bit of reworking.. (ill 
check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez"  wrote:

Hi,

We have observed that creating an index on a Map field causes the creation 
of an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
Nevertheless, we would expect that only entries whose Map field contain the 
key used in the index would have the corresponding index entry. With this 
behavior, the memory consumed by the index could be much higher than needed 
depending on the percentage of entries whose Map field contained the key in the 
index.

---
Example:
We have a region with entries whose key type is a String and the value type 
is an object with a field called "field1" of Map type.

We expect to run queries on the region like the following:

SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

We create a Map index to speed up the above queries:

gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

We do the following puts:
- Put entry with key="key1" and with value=
- Put entry with key="key2" and with value=

The observation is that Geode creates two index entries for each entry. For 
the first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

These are the stats shown by gfsh after doing the above puts:

gfsh>list indexes --with-stats=yes
Member Name |Member ID|   Region Path|  
 Name   | Type  | Indexed Expression  |From Clause | Valid Index | Uses 
| Updates | Update Time | Keys | Values
--- | --- |  | 
 | - | - | -- | 
--- |  | --- | --- |  | --
server1 | 192.168.0.26(server1:1109606):41000 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 1   
 | 1   | 0   | 1| 1
server2 | 192.168.0.26(server2:1109695):41001 | /example-region1 | 
mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true| 1   
 | 1   | 0   | 1| 1
---

Is there any reason why Geode would create an index entry for the second 
entry given that the Map field does not contain the key in the Map index?

I have created a draft pull request changing the behavior of Geode to not 
create the index entry when the Map field does not contain the key used in the 
index. Only two Unit test cases had to be adjusted. Please see: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&reserved=0

With this change and the same scenario as the one in the example, only one 
index entry is created. The stats shown by gfsh after the change are the 
following:

gfsh>list indexes --with-stats=yes
Member Name |Member ID

Re: Question about Map indexes

2021-02-16 Thread Alberto Gomez

Hi again,

After investigating a bit more how Map indexes work, I have seen that they are 
also used to support queries that use "!=" as Jason pointed out.

I would have expected that queries using != or NOT would not make use of 
indexes as it is the common practice in databases ([1]) and also as the Geode 
documentation seems to suggest ([2]):

"Indexes are not used in expressions that contain NOT, so in a WHERE clause of 
a query, qty >= 10 could have an index on qty applied for efficiency. However, 
NOT(qty < 10) could not have the same index applied."

Could somebody please confirm or deny if what the documentation states above is 
true or false and also if the conclusion can also be extended to the use of the 
!= operator?

I also think that the documentation about indexes could be improved at least in 
two areas:

  *   Information about range indexes. While there is a section for the 
deprecated Hash Indexes, there is no specific section for Range Indexes.
  *   Information about Map indexes. The information about these indexes lacks 
a bit of detail. For example, how does the index work when the entry does not 
contain the Map field for which there is an index? How does it behave when the 
Map field does not have the key in the index? How does it behave when the key 
is null or when the value is null?

Does anyone have plans to extend the information about indexes in Geode?

Thanks,

Alberto G.

[1] 
https://stackoverflow.com/questions/1759476/database-index-not-used-if-the-where-criteria-is
[2] 
https://geode.apache.org/docs/guide/19/developing/query_index/indexing_guidelines.html


________
From: Alberto Gomez 
Sent: Saturday, February 13, 2021 5:40 PM
To: dev@geode.apache.org 
Subject: Re: Question about Map indexes

Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without 
(apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between 
the case where the Map does not contain the key from the case where the Map 
contains the key but the value for the key is null. If we use Map.get() we get 
in both cases null but we should only add the index entry in the latter case 
(when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary 
exception to be able to distinguish both cases. Anyway, could you please check 
if we are in the right direction?

Thanks,

Alberto



From: Jason Huynh 
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org 
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought 
that comes to mind is when someone does a != query.  The index still has to 
supply the correct answer to the query (all entries with null or undefined 
values possibly)

I'll try to think of other cases where it might matter.  There may be other 
ways to execute the query but it would probably take a bit of reworking.. (ill 
check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez"  wrote:

Hi,

We have observed that creating an index on a Map field causes the creation 
of an index entry for every entry created in the region containing the Map, no 
matter if the Map field contained the key used in the index.
Nevertheless, we would expect that only entries whose Map field contain the 
key used in the index would have the corresponding index entry. With this 
behavior, the memory consumed by the index could be much higher than needed 
depending on the percentage of entries whose Map field contained the key in the 
index.

---
Example:
We have a region with entries whose key type is a String and the value type 
is an object with a field called "field1" of Map type.

We expect to run queries on the region like the following:

SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

We create a Map index to speed up the above queries:

gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" 
--region="/example-region1 r"

We do the following puts:
- Put entry with key="key1" and with value=
- Put entry with key="key2" and with value=

The observation is that Geode creates two index entries for each entry. For 
the first entry, the internal indexKey is "key1" and for the second one, the 
internal indexKey is null.

These are the stats shown by gfsh after doing the above puts:

gfsh>list indexes --with-stats=yes
Member Name |Member ID|   Re

Re: [DISCUSS] client/server communications and versioning

2021-02-23 Thread Alberto Gomez

+1

This proposal makes a lot of sense.

Besides, I recently sent a proposal to allow clients to communicate with 
servers in an older version in case the compatibility was not broken in the new 
version of the client ([1]). With your proposal, the aim of that RFC could also 
be achieved. Following the example you have added to the JIRA ticket, a client 
with version 1.17 would be able to communicate with servers with version 1.15 
or 1.16 given that the client server protocol for the client would be 1.15.

[1] 
https://cwiki.apache.org/confluence/display/GEODE/Add+option+to+allow+newer+Geode+clients+to+connect+to+older+Geode+servers

BR,

Alberto G.

From: Bruce Schuchardt 
Sent: Tuesday, February 23, 2021 6:38 PM
To: dev@geode.apache.org 
Subject: [DISCUSS] client/server communications and versioning

I’m considering a change in client/server communications that I would like 
feedback on.

We haven’t changed on-wire client/server communications since v1.8 yet we tie 
these communications to the current version.  The support/1.14 branch 
identifies clients as needing v1.14 for serialization/deserialization, for 
instance, even though nothing has changed in years.

If we put out a patch release, say v1.12.1, clients running that patch version 
cannot communicate with servers running v1.12.0.  They also can’t communicate 
with a server running v1.13.0 because that server doesn’t know anything about 
v1.12.1 and will reject the client.  To solve that problem we currently have to 
issue a new 1.13 release that knows about v1.12.1 and users have to roll their 
servers to the new v1.13.1.

I propose to change this so that the client’s on-wire version is decoupled from 
the “current version”.  A client can be running v1.14.0 but could use v1.8.0 as 
its protocol version for communications.

This would have an impact on contributors to the project.  If you need to 
change the client/server protocol version you will need to modify 
KnownVersion.java to specify the change, and should let everyone know about the 
change.

See https://issues.apache.org/jira/browse/GEODE-8963

Question about closing of all connections towards an endpoint in C++ native client

2021-02-24 Thread Alberto Gomez

Hi,

Running some tests with the C++ native client and looking at the code, I have 
observed that when an error in a connection towards an endpoint (timeout, IO 
error) is detected, not only the faulty connection is closed but the endpoint 
is set to "not connected" status which eventually provokes that all other open 
connections towards that endpoint are closed when used.

I have not seen that behavior in the Java client, i.e., the Java client, when 
it detects an error in a connection towards an endpoint, it closes that 
connection but does not act on other connections towards that endpoint.

Are my observations correct?

If so, shouldn't the C++ native client be aligned with the Java client?

Thanks,

Alberto G.

Re: Question about closing of all connections towards an endpoint in C++ native client

2021-02-24 Thread Alberto Gomez

Thanks Jake. I totally agree with you.

Interestingly, that logic has been recently removed from the C++ client when we 
switched from ACE_SOCK to boost::asio so what I said in my previous e-mail 
pertained to the C++ client 1.13 version and older.

Alberto

From: Jacob Barrett 
Sent: Wednesday, February 24, 2021 4:24 PM
To: dev@geode.apache.org 
Subject: Re: Question about closing of all connections towards an endpoint in 
C++ native client

The Java client does the same thing under certain conditions. Neither of the 
clients should do this though. I think this model is way too overaggressive. I 
think we should remove that logic entirely. If we think we want something that 
proactively checks the other connections to that server we could have a 
background thread go through and send a ping request on one the next in the 
queue. If it doesn’t respond then terminate that connection. Continue until a 
pong response is received.

-Jake

> On Feb 24, 2021, at 4:36 AM, Alberto Gomez  wrote:
>
> Hi,
>
> Running some tests with the C++ native client and looking at the code, I have 
> observed that when an error in a connection towards an endpoint (timeout, IO 
> error) is detected, not only the faulty connection is closed but the endpoint 
> is set to "not connected" status which eventually provokes that all other 
> open connections towards that endpoint are closed when used.
>
> I have not seen that behavior in the Java client, i.e., the Java client, when 
> it detects an error in a connection towards an endpoint, it closes that 
> connection but does not act on other connections towards that endpoint.
>
> Are my observations correct?
>
> If so, shouldn't the C++ native client be aligned with the Java client?
>
> Thanks,
>
> Alberto G.

[DISCUSS] CODEOWNERS mechanism feedback

2021-03-17 Thread Alberto Gomez

Hi,

It's been more than two months since the CODEOWNERS file has been in place to 
automatically add reviewers to pull requests. While we have seen the great 
benefit of having the experts in the matter being automatically assigned as 
reviewers to each pull request, I have the feeling that the review process is 
taking longer now. Some possible reasons could be:
1. Some code owners might be getting more reviews than they can cope with and 
they have become a bottleneck.
2. While prior to this change only two approvals were necessary, with the new 
process the number of approvals from reviewers required to approve a pull 
request can be much higher than two, depending on the number of areas touched 
by the PR.

Again, this might just be my feeling or something incidental and only related 
to the pull requests I have been working on. In any case, I would like to know 
if others are experiencing this slowdown in the review of their pull requests.

Also, I do not know if there are metrics available for the review process. For 
example, the average time taken since a pull request is submitted or a change 
is made on it until there is a review. Having these types of metrics would be 
very useful because they would allow us to evaluate this mechanism from 
perspectives other than the quality of the reviews and to propose corrective 
actions if necessary.

Best regards,

Alberto

CODEWATCHERS file effects

2021-03-23 Thread Alberto Gomez

Hi,

I have recently added myself to the CODEWATCHERS file to be assigned as 
reviewer to PRs touching certain areas of the code but seems that I am being 
added to many more PRs that what I intended, even to Draft PRs.

Is anybody else experiencing the same?

Thanks,

Alberto

Re: CODEWATCHERS file effects

2021-03-24 Thread Alberto Gomez

Hi Owen,

Here are some PRs I feel I was added to in error:

https://github.com/apache/geode/pull/6177
https://github.com/apache/geode/pull/6156
https://github.com/apache/geode/pull/6153
https://github.com/apache/geode/pull/6151
https://github.com/apache/geode/pull/6075
https://github.com/apache/geode/pull/6116

Best regards,

Alberto


From: Owen Nichols 
Sent: Wednesday, March 24, 2021 2:30 AM
To: dev@geode.apache.org 
Subject: Re: CODEWATCHERS file effects

Hi Alberto, is there a specific PR you feel you were added to in error?  I 
spot-checked #6179 and there was one test change in geode-wan so that one seems 
correct.

I am looking for a solution to avoid adding watchers to draft PRs until they 
are taken out of draft mode, but it's non-trivial so I don't have an ETA yet.

On 3/23/21, 12:08 PM, "Alberto Gomez"  wrote:

Hi,

I have recently added myself to the CODEWATCHERS file to be assigned as 
reviewer to PRs touching certain areas of the code but seems that I am being 
added to many more PRs that what I intended, even to Draft PRs.

Is anybody else experiencing the same?

Thanks,

Alberto

1 2 >

1 - 100 of 154 matches

Mail list logo