I had another "incorrect" failover this morning. The secondary server was
responding and processing user requests fine (as far as customers and
monitoring reported) but there is a nearly 2 minute gap in the sql log-where of
course no updates are made to the servgrp_board table. According to what you
say here, that would cause the failover alright...but what would cause the
"hang-up" like this? There are actually no logs at all for this 2 minute time
period....That is obviously a different issue.
BUT the other intervals it appears to be finding delinquency, I don't see a gap
like that. All of the updates take place (see example below). Does it matter
what order they happen in? What's the logic of figuring out delinquency?
>From SQL log on ncc392:
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9944 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9956 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9964 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9976 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:00.9981 */COMMIT WORK
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0016 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0248 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0255 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0267 */OK
000000 > <USER: > /* Wed Jun 22
2011 09:59:31.0272 */COMMIT WORK
But in the server group log on ncc391:
<SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us
determined to be delinquent with 4 operation(s) owned
<SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us
determined to be operational
In the sql log on ncc391:
Wed Jun 22 2011 09:59:00.9937 */SELECT checkInterval FROM servgrp_config FOR
UPDATE
Wed Jun 22 2011 09:59:01.0138 */OK
Wed Jun 22 2011 09:59:01.0139 */UPDATE servgrp_board SET intervalCount =
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:01.0145 */OK
Wed Jun 22 2011 09:59:01.0145 */SELECT
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board
ORDER BY 1
Wed Jun 22 2011 09:59:01.0150 */OK
Wed Jun 22 2011 09:59:01.0160 */COMMIT WORK
Wed Jun 22 2011 09:59:31.0188 */SELECT checkInterval FROM servgrp_config FOR
UPDATE
Wed Jun 22 2011 09:59:31.0194 */OK
Wed Jun 22 2011 09:59:31.0200 */OK
Wed Jun 22 2011 09:59:31.0200 */UPDATE servgrp_board SET intervalCount =
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:31.0202 */SELECT C18058 FROM T68 WHERE C1 =
'000000000000015'
Wed Jun 22 2011 09:59:31.0207 */OK
Wed Jun 22 2011 09:59:31.0207 */SELECT
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board
ORDER BY 1
Wed Jun 22 2011 09:59:31.0209 */OK
Wed Jun 22 2011 09:59:31.0212 */UPDATE T69 SET
C18099=1,C18103=NULL,C5='AREMAILUSER_PQHg0JAAPwAA',C6=1308751171 WHERE C1 =
'000000010478546'
Wed Jun 22 2011 09:59:31.0215 */OK
Wed Jun 22 2011 09:59:31.0220 */OK
Wed Jun 22 2011 09:59:31.0220 */UPDATE T69 SET C18087 = NULL WHERE C1 IN
('000000010478546')
Wed Jun 22 2011 09:59:31.0226 */COMMIT WORK
Wed Jun 22 2011 09:59:31.0226 */OK
Wed Jun 22 2011 09:59:31.0227 */COMMIT WORK
Anne Ramey
***********************************
E-mail correspondence to and from this address may be subject to the North
Carolina Public Records Law and may be disclosed to third parties only by an
authorized State Official.
From: Action Request System discussion list(ARSList)
[mailto:[email protected]] On Behalf Of Walters, Mark
Sent: Monday, May 23, 2011 10:51 AM
To: [email protected]
Subject: Re: server group delinquency, inaccurate
**
The signalling options should not be relevant for delinquency checks - this is
done via updates to the database. In the SERVGRP_BOARD table there is a record
for each server in your group. Every check interval a server will increment
its own INTERVALCOUNT value and note the values of the other servers. At the
next check interval the stored values are compared to those in the database
and, if any of them have not changed, that server is considered delinquent. If
this happens X successive times, where X is the delinquency threshold, then the
server is considered to have failed and a failover of the services to the next
highest ranked server takes place.
Although there's a dedicated server group thread that handles this process it's
not uncommon to see an occasional log entry reporting a delinquent server.
This happens due to slight timing variations of when the intervalcount update
process is performed on each server. Each server has an exclusive lock on the
SERVGRP tables during the update process and this can lead to one server
running 2 check interval updates before another server has been able to perform
its update.
However, as you had a failover, something must have prevented the intervalcount
update on the server called secondary. You could try increasing the
delinquency threshold to 3 to give each server more time for the update to run
or enabling SQL logging and check the updates to the SERVGRP_BOARD table. You
should see something like the following on each server every checkinterval;
SELECT checkInterval FROM servgrp_config FOR UPDATE
OK
UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName =
'myservername'
OK
SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM
servgrp_board ORDER BY 1
OK
COMMIT WORK
The SELECT FOR UPDATE is used to serialise access to the tables and provide
exclusive access for the server during the update. If the clocks on the
servers are sync'd you should be able to explain most of the delinquency
reports in the server group log and perhaps understand why one was missed.
Mark
I work for BMC, I don't speak for them.
From: Action Request System discussion list(ARSList)
[mailto:[email protected]] On Behalf Of Ramey, Anne
Sent: 21 May 2011 02:14
To: [email protected]
Subject: server group delinquency, inaccurate
**
Since we upgraded from ARS 7.1 to ARS 7.5 (patch 7), we've seen a lot of
"incorrect" delinquency messages in the server group logs-each server detecting
the other as delinquent quite frequently, even when they are responding just
fine. It happens at all hours, even after backups are finished and no one is
in the office for the morning yet. The systems are NOT busy, but they are
being "delinquent". We are using the same settings of a 30 second interval with
2 intervals before failover that we used before the upgrade.
Server-Group-Signal-Option is already false. We don't want to go all the way
to Disable-ARSignals. It got bad enough today to happen for two intervals and
cause a failover during the work day. This, of course, caused the system to
hang for a moment, and it managed to kill the email engine-which didn't fail
back over properly. Has anyone else with ARS 7.5 p 7 seen something like this?
Primary:
<SGRP> /* Fri May 20 2011 11:57:37.5506 */ Server <secondary> determined to be
operational
<SGRP> /* Fri May 20 2011 12:00:07.6808 */ Server <secondary> determined to be
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 12:00:37.7094 */ Server <secondary> determined to be
operational
<SGRP> /* Fri May 20 2011 16:49:52.4853 */ Server <secondary> determined to be
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 16:50:22.4912 */ Server <secondary> delinquent for 2
intervals with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 16:50:22.4913 */ Transferring pending full text
indexing tasks from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4939 */ Transferred 0 pending full text
indexing tasks from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4939 */ Failing over E-Mail Engine operation
from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4956 */ Resuming E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:50:22.4993 */ Failing over Flashboards operation
from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5011 */ Resuming Flashboards operation
<SGRP> /* Fri May 20 2011 16:50:22.5038 */ Failing over Assignment Engine
operation from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5055 */ Resuming Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:50:22.5096 */ Failing over Reconciliation Engine
operation from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5113 */ Resuming Reconciliation Engine
operation
<SGRP> /* Fri May 20 2011 16:54:22.6434 */ Server <secondary> determined to be
operational
<SGRP> /* Fri May 20 2011 16:54:22.6435 */ Server <primary> relinquishing
E-Mail Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6435 */ Suspending E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:22.6460 */ Server <primary> relinquishing
Flashboards operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6461 */ Suspending Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:22.6544 */ Server <primary> relinquishing
Assignment Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6544 */ Suspending Assignment Engine
operation
<SGRP> /* Fri May 20 2011 16:54:22.6615 */ Server <primary> relinquishing
Reconciliation Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6616 */ Suspending Reconciliation Engine
operation
<SGRP> /* Fri May 20 2011 18:44:11.7233 */ Server <secondary> determined to be
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:44:41.7522 */ Server <secondary> determined to be
operational
<SGRP> /* Fri May 20 2011 18:56:42.0804 */ Server <secondary> determined to be
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:57:12.0894 */ Server <secondary> determined to be
operational
<SGRP> /* Fri May 20 2011 18:58:42.1558 */ Server <secondary> determined to be
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:59:12.2134 */ Server <secondary> determined to be
operational
Secondary:
<SGRP> /* Fri May 20 2011 11:56:37.4981 */ Server <primary> determined to be
operational
<SGRP> /* Fri May 20 2011 11:59:37.6687 */ Server <primary> determined to be
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 12:00:07.9562 */ Server <primary> determined to be
operational
<SGRP> /* Fri May 20 2011 16:54:02.0895 */ Ownership of the E-Mail Engine
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0896 */ Suspending E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:02.0932 */ Ownership of the Flashboards
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0933 */ Suspending Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:02.0963 */ Ownership of the Assignment Engine
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0963 */ Suspending Assignment Engine
operation
<SGRP> /* Fri May 20 2011 16:54:02.1026 */ Ownership of the Reconciliation
Engine operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.1027 */ Suspending Reconciliation Engine
operation
<SGRP> /* Fri May 20 2011 16:54:32.1604 */ Server <secondary> assuming
responsibility for unowned E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1605 */ Resuming E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1634 */ Server <secondary> assuming
responsibility for unowned Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:32.1635 */ Resuming Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:32.1683 */ Server <secondary> assuming
responsibility for unowned Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1684 */ Resuming Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1719 */ Server <secondary> assuming
responsibility for unowned Reconciliation Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1719 */ Resuming Reconciliation Engine
operation
<SGRP> /* Fri May 20 2011 18:51:41.9330 */ Server <primary> determined to be
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 18:52:11.9556 */ Server <primary> determined to be
operational
<SGRP> /* Fri May 20 2011 18:57:42.1122 */ Server <primary> determined to be
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 18:58:12.1257 */ Server <primary> determined to be
operational
Anne Ramey
***********************************
E-mail correspondence to and from this address may be subject to the North
Carolina Public Records Law and may be disclosed to third parties only by an
authorized State Official.
_attend WWRUG11 www.wwrug.com<http://www.wwrug.com> ARSlist: "Where the Answers
Are"_
_attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_
_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"