Tuesday, January 18, 2011

Exchange 2010 DAG DR Test, what has happened:


We have 4 mailbox server at two sites, MBP01/03 at site one and MPB02/04 at site 2. For the DAG, the witness server HUP01 is at site 1 and secondary witness server HUP02 is at site 2


The DR start at 3:18PM, here is the step and logs from start:


3:18
Power off mbp01/mbp03
Cluster node 'MBP03' was removed
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:18:57 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      mbp04.na.Domain.corp
Description:
Cluster node 'MBP03' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster.
'File Share Witness failed

74 times
3:21:03-3:22:33
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:21:03 PM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      MBP02.na.Domain.corp
Description:
Cluster resource 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)' in clustered service or application 'Cluster Group' failed.
The Cluster service is shutting down
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:22:33 PM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      mbp04.na.Domain.corp
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.


Log Name:      System
Source:        Service Control Manager
Date:          11/15/2010 3:22:34 PM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBP02.na.Domain.corp
Description:
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..
Run Stop-DAG
[2010-11-15T20:30:50] mbp01 added to stopped list
[2010-11-15T20:30:50] mbp01 removed from started list
[2010-11-15T20:53:58] mbp03 added to stopped list
[2010-11-15T20:53:58] mbp03 removed from started list

[2010-11-15T21:02:29] updated the started servers list in AD:
[2010-11-15T21:02:29]     mbp04
[2010-11-15T21:02:29]     mbp02
[2010-11-15T21:02:29] updated the stopped servers list in AD:
[2010-11-15T21:02:29]     mbp01
[2010-11-15T21:02:29]     mbp03
Enable trace
Extra.exe





The .etl file is a binary file and can only be converted using tools available to Microsoft support engineers. Or using tracerpt to convert etl file to XML file
tracerpt logfile1.etl -o -report
4:10 Rese DAG
[2010-11-15T21:10:15] Successfully resolved the servers based on the stopped servers list.
[2010-11-15T21:10:15] The following servers are in the StartedServers list (The list is all of the servers that are not in the StoppedServers list.):
[2010-11-15T21:10:15]     mbp04
[2010-11-15T21:10:15]     mbp02
[2010-11-15T21:10:15] The following servers are in the StoppedServers list:
[2010-11-15T21:10:15]     mbp01
[2010-11-15T21:10:15]     mbp03
[2010-11-15T21:10:15] Checking if clussvc is running on MBP04...
[2010-11-15T21:10:15] Clussvc is Stopped on MBP04.
[2010-11-15T21:10:15] Checking if clussvc is running on MBP02...
[2010-11-15T21:10:15] Clussvc is Stopped on MBP02.
[2010-11-15T21:10:15] None of the 2 servers in the started list had clussvc running, so it is safe to continue with rese-dag, but forcequorum will be necessary.

[2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP04...
[2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP04. Starting it.
[2010-11-15T21:10:17] Starting cluster service with /fq (force quorum).
[2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state.
[2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP02...
[2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP02. Starting it.
[2010-11-15T21:10:17] Starting cluster service in normal mode.
[2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state.

Cluster Info:
Nodes
[2010-11-15T21:10:20]     node: MBP01.na.Domain.corp [ state = Down ]
[2010-11-15T21:10:20]     node: MBP02.na.Domain.corp [ state = Joining ]
[2010-11-15T21:10:20]     node: MBP03.na.Domain.corp [ state = Down ]
[2010-11-15T21:10:20]     node: MBP04.na.Domain.corp [ state = Up ]


eviction operation:
[2010-11-15T21:10:21] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'...
[2010-11-15T21:10:43] Server 'mbp03' is still a node in the cluster, and will have to be evicted.
[2010-11-15T21:10:43] Updated Progress 'Evicting MBP03.' 55%.
[2010-11-15T21:10:43] Working
[2010-11-15T21:10:43] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'...
[2010-11-15T21:11:06] The following log entry comes from a different process that's running on machine 'MBP04.na.Domain.corp'.

Check File Share Witness:
[2010-11-15T21:11:06] Checking that the file share witness server (hup02.na.Domain.corp) isn't one of the servers in the DAG.
[2010-11-15T21:11:06] Checking if the FSW server can be queried with WMI.
[2010-11-15T21:11:06] The boot time of the FSW was '11/15/2010 7:30:40 PM' (MaxValue if the call failed).
[2010-11-15T21:11:06] Creating the file share of the FSW...
[2010-11-15T21:11:06] There are 2 started servers in the cluster, which is an even number. That requires a file share witness!
[2010-11-15T21:11:06] Working
[2010-11-15T21:11:11] CreateFileShareWitnessQuorum: There is already a FSW, but the current share path (\\hup01.na.Domain.corp\DAG01.na.Domain.corp) is not what's desired (\\hup02.na.Domain.corp\DAG01.na.Domain.corp). Will try to fix it.
[2010-11-15T21:11:11] The fsw resource is now in state Online.
[2010-11-15T21:11:11] The FSW resource is now in state Online.
[2010-11-15T21:11:11] The current quorum resource is 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'. About to set it to the FSW.
[2010-11-15T21:11:11] The quorum resource is now 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'.


For DAGs with only two members, DAC mode also uses the boot time of the DAG's witness server to determine whether it can mount databases on startup. The boot time of the witness server is compared to the time when the DACP bit was set to 1.
·         If the time the DACP bit was set is earlier than the boot time of the witness server, the system assumes that the DAG member and witness server were rebooted at the same time (perhaps because of power loss in the primary datacenter), and the DAG member isn't permitted to mount databases.
·         If the time that the DACP bit was set is more recent than the boot time of the witness server, the system assumes that the DAG member was rebooted for some other reason (perhaps a scheduled outage in which maintenance was performed or perhaps a system crash or power loss isolated to the DAG member), and the DAG member is permitted to mount databases





[2010-11-15T21:11:11] Bringing the quorum resource online...
MBP01/03 back online
[2010-11-15T21:25:31] commandline:         $scriptCmd = {& $wrappedCmd @PSBoundParameters }
[2010-11-15T21:25:31] Option 'Identity' = 'dag01'.
[2010-11-15T21:25:31] Option 'MailboxServer' = 'mbp01'.
[2010-11-15T21:25:39] Updated Progress 'Joined server 'mbp01' to the cluster
[2010-11-15T21:26:17] Updated Progress 'Joining mbp03
Force Witness
During the datacenter switchover process, the DAG was configured to use an alternate witness server. The DAG must be reconfigured to use a witness server in the primary datacenter. If you are using the same witness server and witness direcy that was used prior to the primary datacenter outage, you can run the Set-DatabaseAvailabilityGroup -Identity DAGName command. If you plan on using a witness server or witness direcy that is different from the original witness server and direcy, use the Set-DatabaseAvailabilityGroup command to configure the witness server and witness direcy parameters with the appropriate values.


No comments: