[ https://issues.apache.org/jira/browse/GEODE-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052311#comment-17052311 ]
Aaron Lindsey commented on GEODE-7830: -------------------------------------- The "operator" in my case is actually a Kubernetes controller. It calls rebalance each time Kubernetes tries to stop a Geode server to ensure data is not lost. In this case it is very common to call rebalance when there are no regions, e.g. during a scaling operation before the user has created any regions. Right now we have to parse the status message to determine if the rebalance failed due to the no-op error, and then ignore it. Do you know if having no regions is the only reason the rebalance API will return the no-op error? If we were sure of that, then we could call list regions to make sure regions exist before calling rebalance. FWIW, I think it would be best to assume that consumers of this REST API will be programs, not humans, and therefore we should design it in such a way that it would be easy to consume programatically. It's much more reliable to programmatically check the size of an array rather than parse a status message to determine if the rebalance succeeded. > Management REST API rebalance endpoints return confusing operationResults > ------------------------------------------------------------------------- > > Key: GEODE-7830 > URL: https://issues.apache.org/jira/browse/GEODE-7830 > Project: Geode > Issue Type: Bug > Components: management > Reporter: Aaron Lindsey > Assignee: Darrel Schneider > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > We observed odd behavior regarding the operationResult object returned in the > rebalance API: > # It contains success=false if the cluster has no regions or has no servers. > This is confusing because the rebalance didn't fail — it just didn't have > anything to rebalance so it was basically a no-op. As a consumer of this API, > I need to be able to distinguish between "real" failures and this "no-op" > failure, and I should not have to write code to parse the "statusMessage" to > do that. > # Sometimes, success=true and other times success=false for the same > statusMessage: "Distributed system has no regions that can be rebalanced." > This is confusing because I don't know why it sometimes considers this a > failure and other times considers it a success. If #1 above is fixed, then > this would not be an issue because it would always return success=true for > this particular statusMessage. > Here is an example of two confusing operationResults we observed: > {code:json} > { > "result": [ > { > "statusCode": "OK", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/15dfe6ef-acaf-4a45-9b55-1d855a977ba8", > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances" > }, > "operationStart": "2020-02-25T18:53:34.058Z", > "operationEnd": "2020-02-25T18:53:34.063Z", > "operationId": "15dfe6ef-acaf-4a45-9b55-1d855a977ba8", > "operation": { > "simulate": false > }, > "operationResult": { > "statusMessage": "Distributed system has no regions that can be > rebalanced.", > "success": true > } > }, > { > "statusCode": "OK", > "links": { > "self": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/8218ce0d-e3b8-4c49-b925-665a28e821c3", > "list": > "http://geodecluster-sample-locator.default/management/v1/operations/rebalances" > }, > "operationStart": "2020-02-25T18:53:45.650Z", > "operationEnd": "2020-02-25T18:53:45.654Z", > "operationId": "8218ce0d-e3b8-4c49-b925-665a28e821c3", > "operation": { > "simulate": false > }, > "operationResult": { > "statusMessage": "Distributed system has no regions that can be > rebalanced.", > "success": false > } > } > ], > "statusCode": "OK" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)