Just do IT@stephenit@gmail.com: 2011

On our Exchange 2010 DR test, we lost all the nodes in the cluster. Below the clear up steps which we remove the old setting and re-add the nodes, database back to DAG:

Get DAG info	Get-DAG shows node 1/3 are stopped node and node 2 is started but missing node4
Remove DAG	*Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP01* WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2010-11-15_14-15-28.839_remove-databaseavailabiltygroupserver.log". Mailbox server 'MBP01' cannot be removed from the database availability group because mailbox database 'DB01' has multiple copies. Use Remove-MailboxDatabaseCopy either to remove the copy from this server or to remove the copies from other servers in the database availability group. + CategoryInfo : InvalidArgument: (:) [Remove-DatabaseAvailabilityGroupServer], RemoveDagServer...icatedE xception + FullyQualifiedErrorId : C26FF955,Microsoft.Exchange.Management.SystemConfigurationTasks.RemoveDatabaseAvailabili tyGroupServer [PS] C:\Windows\system32>Get-MailboxDatabaseCopyStatus *\MBP01 Name Status CopyQueue ReplayQueue LastInspectedLogTime ContentIndex Length Length State ---- ------ --------- ----------- -------------------- ------------ DB01\MBP01 Dismounted 0 0 Failed DB02\MBP01 Failed 0 0 11/12/2010 11:50:57 AM Failed DB03\MBP01 Failed 0 0 11/12/2010 11:47:36 AM Failed DB07\MBP01 Failed 0 0 11/12/2010 11:47:23 AM Failed
Remove DB	Get-MailboxDatabaseCopyStatus \MBP01 Name Status CopyQueue ReplayQueue LastInspectedLogTime ContentIndex Length Length State ---- ------ --------- ----------- -------------------- ------------ DB01\MBP01 Dismounted 0 0 Failed DB02\MBP01 Failed 0 0 11/12/2010 11:50:57 AM Failed DB03\MBP01 Failed 0 0 11/12/2010 11:47:36 AM Failed DB07\MBP01 Failed 0 0 11/12/2010 11:47:23 AM Failed Get-MailboxDatabaseCopyStatus \MBP01 \| Remove-MailboxDatabaseCopy Get-MailboxDatabaseCopyStatus \MBP03 \| Remove-MailboxDatabaseCopy Get-MailboxDatabaseCopyStatus \MBP04 \| Remove-MailboxDatabaseCopy Get-MailboxDatabaseCopyStatus *\MBP02 \| Remove-MailboxDatabaseCopy Example log: The database "DB01" is currently hosted on server "MBP01". Use Move-ActiveMailboxDatabase to move the active copy of the database to a different server. You can use the Remove-MailboxDatabase task if this is the only copy.Confirm
Remove Server	*Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP01* *Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP02* *Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP03* *Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP04* Confirm Are you sure you want to perform this action? Removing Mailbox server "MBP01" from database availability group "DAG01". [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): a
Clear up cluster	On each nodes: *cluster node /force* Attempting to clean up node '' ... Clean up successfully completed.
Disable DAG computer account	In AD computer OU: · Disable tag01 computer object · Assign “Exchange trusted subsystem” full control to the object:
Remove failover cluster role	On each node: *servermanagercmd -r failover-clustering* Servermanagercmd.exe is deprecated, and is not guaranteed to be supported in fut ure releases of Windows. We recommend that you use the Windows PowerShell cmdlets that are available for Server Manager. Start Removal... [Removal] Succeeded: [Failover Clustering] Failover Clustering. Success: Removal succeeded.
Add servers back to DAG	Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP01 Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP02 Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP03 Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP04
Test Cluster	Move cluster resource group to each nodes: *cluster group "cluster group" /moveto:mbp02* Group Node Status -------------------- --------------- ------ cluster group MBP02 Online
Get info	Get-MailboxDatabaseCopyStatus * Name Status CopyQueue ReplayQueue LastInspectedLogTime ContentIndex Length Length State ---- ------ --------- ----------- -------------------- ------------ DB03\MBP02 Mounted 0 0 Healthy DB04\MBP02 Dismounted 0 0 Failed DB07\MBP04 Mounted 0 0 Healthy DB08\MBP04 Dismounted 0 0 Failed DB02\MBP04 Dismounted 0 0 Failed DB06\MBP04 Dismounted 0 0 Failed DB05\MBP03 Mounted 0 0 Healthy DB01\MBP01 Dismounted 0 0 Failed DR_DB08\UMP01 Mounted 0 0 Healthy drdb01\UMP01 Mounted 0 0 Healthy Mailbox Database 1150652460\UMP01 Dismounted 0 0 Failed dr_db02\UMP01 Mounted 0 0 Healthy dr_db03\UMP01 Mounted 0 0 Healthy dr_db04\UMP01 Mounted 0 0 Healthy DR_DB07\UMP01 Mounted 0 0 Healthy dr_db05\UMP01 Mounted 0 0 Healthy DR_DB06\UMP01 Mounted 0 0 Healthy
Remove old database	· Clean up all move request Get-MoveRequest \| Remove-MoveRequest · Move users to DR databae *Get-MailboxDatabase \| Mount-Database* *Get-Mailbox -Database DB08 \| New-MoveRequest -TargetDatabase dr_db08* *Get-MoveRequestStatistics* *Get-MoveRequest \| Remove-MoveRequest* · Mount public folder database : Get-PublicFolderDatabase \| Mount-Database · Format all the database drives on each node
Create DB	Create the “Logs” folder in the each database folder and run commands: new-mailboxdatabase -Server 'MBP01' -Name 'DB02' -EdbFilePath 'D:\MPS\DB02\DB02.EDB' -LogFolderPath 'D:\MPS\DB02\LOGS' new-mailboxdatabase -Server 'MBP02' -Name 'DB03' -EdbFilePath 'D:\MPS\DB03\DB03.EDB' -LogFolderPath 'D:\MPS\DB03\LOGS' new-mailboxdatabase -Server 'MBP02' -Name 'DB04' -EdbFilePath 'D:\MPS\DB04\DB04.EDB' -LogFolderPath 'D:\MPS\DB04\LOGS' new-mailboxdatabase -Server 'MBP03' -Name 'DB05' -EdbFilePath 'D:\MPS\DB05\DB05.EDB' -LogFolderPath 'D:\MPS\DB05\LOGS' new-mailboxdatabase -Server 'MBP03' -Name 'DB06' -EdbFilePath 'D:\MPS\DB06\DB06.EDB' -LogFolderPath 'D:\MPS\DB06\LOGS' new-mailboxdatabase -Server 'MBP04' -Name 'DB07' -EdbFilePath 'D:\MPS\DB07\DB07.EDB' -LogFolderPath 'D:\MPS\DB07\LOGS' new-mailboxdatabase -Server 'MBP04' -Name 'DB08' -EdbFilePath 'D:\MPS\DB08\DB08.EDB' -LogFolderPath 'D:\MPS\DB07\LOGS'
Mount DB	*Get-MailboxDatabase \| Mount-Database -Force*
Add DB copy	*Add-MailboxDatabaseCopy DB01 -MailboxServer mbp02* *Add-MailboxDatabaseCopy DB02 -MailboxServer mbp04* *Add-MailboxDatabaseCopy DB03 -MailboxServer mbp01* *Add-MailboxDatabaseCopy DB04 -MailboxServer mbp03* *Add-MailboxDatabaseCopy DB05 -MailboxServer mbp02* *Add-MailboxDatabaseCopy DB07 -MailboxServer mbp01 -DomainController dc04*
Troubleshooting	**Get-MailboxDatabaseCopyStatus \| where{ $_.status -like "fail"} \|Suspend-MailboxDatabaseCopy Get-MailboxDatabaseCopyStatus \| Suspend-MailboxDatabaseCopy* Get-MailboxDatabaseCopyStatus \| Update-MailboxDatabaseCopy –DeleteExistingFiles***

We have 4 mailbox server at two sites, MBP01/03 at site one and MPB02/04 at site 2. For the DAG, the witness server HUP01 is at site 1 and secondary witness server HUP02 is at site 2

The DR start at 3:18PM, here is the step and logs from start:

3:18	Power off mbp01/mbp03
Cluster node 'MBP03' was removed	Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 11/15/2010 3:18:57 PM Event ID: 1135 Task Category: Node Mgr Level: Critical Keywords: User: SYSTEM Computer: mbp04.na.Domain.corp Description: Cluster node 'MBP03' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster.
'File Share Witness failed 74 times 3:21:03-3:22:33	Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 11/15/2010 3:21:03 PM Event ID: 1069 Task Category: Resource Control Manager Level: Error Keywords: User: SYSTEM Computer: MBP02.na.Domain.corp Description: Cluster resource 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)' in clustered service or application 'Cluster Group' failed.
The Cluster service is shutting down	Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 11/15/2010 3:22:33 PM Event ID: 1177 Task Category: Quorum Manager Level: Critical Keywords: User: SYSTEM Computer: mbp04.na.Domain.corp Description: *The Cluster service is shutting down because quorum was lost*. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Log Name: System Source: Service Control Manager Date: 11/15/2010 3:22:34 PM Event ID: 7024 Task Category: None Level: Error Keywords: Classic User: N/A Computer: MBP02.na.Domain.corp Description: The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..
Run Stop-DAG	[2010-11-15T20:30:50] mbp01 added to stopped list [2010-11-15T20:30:50] mbp01 removed from started list [2010-11-15T20:53:58] mbp03 added to stopped list [2010-11-15T20:53:58] mbp03 removed from started list [2010-11-15T21:02:29] updated the started servers list in AD: [2010-11-15T21:02:29] mbp04 [2010-11-15T21:02:29] mbp02 [2010-11-15T21:02:29] updated the stopped servers list in AD: [2010-11-15T21:02:29] mbp01 [2010-11-15T21:02:29] mbp03
Enable trace	Extra.exe The .etl file is a binary file and can only be converted using tools available to Microsoft support engineers. Or using tracerpt to convert etl file to XML file tracerpt logfile1.etl -o -report
4:10 Rese DAG	[2010-11-15T21:10:15] Successfully resolved the servers based on the stopped servers list. [2010-11-15T21:10:15] The following servers are in the StartedServers list (The list is all of the servers that are not in the StoppedServers list.): [2010-11-15T21:10:15] mbp04 [2010-11-15T21:10:15] mbp02 [2010-11-15T21:10:15] The following servers are in the StoppedServers list: [2010-11-15T21:10:15] mbp01 [2010-11-15T21:10:15] mbp03 [2010-11-15T21:10:15] Checking if clussvc is running on MBP04... [2010-11-15T21:10:15] Clussvc is Stopped on MBP04. [2010-11-15T21:10:15] Checking if clussvc is running on MBP02... [2010-11-15T21:10:15] Clussvc is Stopped on MBP02. [2010-11-15T21:10:15] None of the 2 servers in the started list had clussvc running, so it is safe to continue with rese-dag, but forcequorum will be necessary. [2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP04... [2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP04. Starting it. [2010-11-15T21:10:17] *Starting cluster service with /fq (force quorum).* [2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state. [2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP02... [2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP02. Starting it. [2010-11-15T21:10:17] *Starting cluster service in normal mode.* [2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state. Cluster Info: Nodes [2010-11-15T21:10:20] node: MBP01.na.Domain.corp [ state = Down ] [2010-11-15T21:10:20] node: MBP02.na.Domain.corp [ state = Joining ] [2010-11-15T21:10:20] node: MBP03.na.Domain.corp [ state = Down ] [2010-11-15T21:10:20] node: MBP04.na.Domain.corp [ state = Up ] *eviction operation:* [2010-11-15T21:10:21] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'... [2010-11-15T21:10:43] Server 'mbp03' is still a node in the cluster, and will have to be evicted. [2010-11-15T21:10:43] Updated Progress 'Evicting MBP03.' 55%. [2010-11-15T21:10:43] Working [2010-11-15T21:10:43] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'... [2010-11-15T21:11:06] The following log entry comes from a different process that's running on machine 'MBP04.na.Domain.corp'. Check File Share Witness: [2010-11-15T21:11:06] Checking that the file share witness server (hup02.na.Domain.corp) isn't one of the servers in the DAG. [2010-11-15T21:11:06] Checking if the FSW server can be queried with WMI. [2010-11-15T21:11:06] The boot time of the FSW was '11/15/2010 7:30:40 PM' (MaxValue if the call failed). [2010-11-15T21:11:06] Creating the file share of the FSW... [2010-11-15T21:11:06] There are 2 started servers in the cluster, which is an even number. That requires a file share witness! [2010-11-15T21:11:06] Updated Progress 'Using file share witness share '\\hup02.na.Domain.corp\DAG01.na.Domain.corp' for an even number of members in the database availability group.' 67%. [2010-11-15T21:11:06] Working [2010-11-15T21:11:11] CreateFileShareWitnessQuorum: There is already a FSW, but the current share path (\\hup01.na.Domain.corp\DAG01.na.Domain.corp) is not what's desired (\\hup02.na.Domain.corp\DAG01.na.Domain.corp). Will try to fix it. [2010-11-15T21:11:11] The fsw resource is now in state Online. [2010-11-15T21:11:11] The FSW resource is now in state Online. [2010-11-15T21:11:11] The current quorum resource is 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'. About to set it to the FSW. [2010-11-15T21:11:11] The quorum resource is now 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'. For DAGs with only two members, DAC mode also uses the boot time of the DAG's witness server to determine whether it can mount databases on startup. The boot time of the witness server is compared to the time when the DACP bit was set to 1. · If the time the DACP bit was set is earlier than the boot time of the witness server, the system assumes that the DAG member and witness server were rebooted at the same time (perhaps because of power loss in the primary datacenter), and the DAG member isn't permitted to mount databases. · If the time that the DACP bit was set is more recent than the boot time of the witness server, the system assumes that the DAG member was rebooted for some other reason (perhaps a scheduled outage in which maintenance was performed or perhaps a system crash or power loss isolated to the DAG member), and the DAG member is permitted to mount databases [2010-11-15T21:11:11] Bringing the quorum resource online...
MBP01/03 back online	[2010-11-15T21:25:31] commandline: $scriptCmd = {& $wrappedCmd @PSBoundParameters } [2010-11-15T21:25:31] Option 'Identity' = 'dag01'. [2010-11-15T21:25:31] Option 'MailboxServer' = 'mbp01'. [2010-11-15T21:25:39] Updated Progress 'Joined server 'mbp01' to the cluster [2010-11-15T21:26:17] Updated Progress 'Joining mbp03
Force Witness	During the datacenter switchover process, the DAG was configured to use an alternate witness server. The DAG must be reconfigured to use a witness server in the primary datacenter. If you are using the same witness server and witness direcy that was used prior to the primary datacenter outage, you can run the Set-DatabaseAvailabilityGroup -Identity DAGName command. If you plan on using a witness server or witness direcy that is different from the original witness server and direcy, use the Set-DatabaseAvailabilityGroup command to configure the witness server and witness direcy parameters with the appropriate values. http://technet.microsoft.com/en-us/library/dd351049.aspx

Just do IT@stephenit@gmail.com

Tuesday, January 18, 2011

Exchange 2010 DR test failed and recovery DAG

Exchange 2010 DAG DR Test, what has happened: