Tuesday, January 18, 2011

Exchange 2010 DR test failed and recovery DAG




On our Exchange 2010 DR test, we lost all the nodes in the cluster. Below the clear up steps which we remove the old setting and re-add the nodes, database back to DAG:


Get DAG info
Get-DAG shows node 1/3 are stopped node and node 2 is started but missing node4
Remove DAG
Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP01
WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file  "C:\ExchangeSetupLogs\DagTasks\dagtask_2010-11-15_14-15-28.839_remove-databaseavailabiltygroupserver.log".
Mailbox server 'MBP01' cannot be removed from the database availability group because mailbox database 'DB01' has  multiple copies. Use Remove-MailboxDatabaseCopy either to remove the copy from this server or to remove the copies from other servers in the database availability group.
    + CategoryInfo          : InvalidArgument: (:) [Remove-DatabaseAvailabilityGroupServer], RemoveDagServer...icatedE
   xception
    + FullyQualifiedErrorId : C26FF955,Microsoft.Exchange.Management.SystemConfigurationTasks.RemoveDatabaseAvailabili
   tyGroupServer

[PS] C:\Windows\system32>Get-MailboxDatabaseCopyStatus *\MBP01

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State       
----                                          ------          --------- ----------- --------------------   ------------
DB01\MBP01                              Dismounted      0         0                                  Failed     
DB02\MBP01                              Failed          0         0           11/12/2010 11:50:57 AM Failed     
DB03\MBP01                              Failed          0         0           11/12/2010 11:47:36 AM Failed     
DB07\MBP01                              Failed          0         0           11/12/2010 11:47:23 AM Failed     

Remove DB
Get-MailboxDatabaseCopyStatus *\MBP01

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State      
----                                          ------          --------- ----------- --------------------   ------------
DB01\MBP01                              Dismounted      0         0                                  Failed     
DB02\MBP01                              Failed          0         0           11/12/2010 11:50:57 AM Failed     
DB03\MBP01                              Failed          0         0           11/12/2010 11:47:36 AM Failed     
DB07\MBP01                              Failed          0         0           11/12/2010 11:47:23 AM Failed     

Get-MailboxDatabaseCopyStatus *\MBP01 | Remove-MailboxDatabaseCopy

Get-MailboxDatabaseCopyStatus *\MBP03 | Remove-MailboxDatabaseCopy

Get-MailboxDatabaseCopyStatus *\MBP04 | Remove-MailboxDatabaseCopy

Get-MailboxDatabaseCopyStatus *\MBP02 | Remove-MailboxDatabaseCopy

Example log:

The database "DB01" is currently hosted on server "MBP01". Use Move-ActiveMailboxDatabase to move the active copy  of the database to a different server. You can use the Remove-MailboxDatabase task if this is the only copy.Confirm

Remove Server
Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP01

Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP02

Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP03

Remove-DatabaseAvailabilityGroupServer -Identity DAG01 -ConfigurationOnly:$TRUE -MailboxServer MBP04

Confirm
Are you sure you want to perform this action?
Removing Mailbox server "MBP01" from database availability group "DAG01".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): a
Clear up cluster
On each nodes:
cluster node /force
Attempting to clean up node '' ...
Clean up successfully completed.
Disable DAG computer account
In AD computer OU:
·         Disable tag01 computer object
·         Assign “Exchange trusted subsystem” full control to the object:


Remove failover cluster role

On each node:

servermanagercmd -r failover-clustering

Servermanagercmd.exe is deprecated, and is not guaranteed to be supported in fut
ure releases of Windows. We recommend that you use the Windows PowerShell cmdlets that are available for Server Manager.

Start Removal...
[Removal] Succeeded: [Failover Clustering] Failover Clustering.

Success: Removal succeeded.
Add servers back to DAG
Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP01
Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP02
Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP03
Add-DatabaseAvailabilityGroupServer -Identity DAG01 -MailboxServer MBP04

Test Cluster
Move cluster resource group to each nodes:
cluster group "cluster group"  /moveto:mbp02

Group                Node            Status
-------------------- --------------- ------
cluster group        MBP02        Online


Get info
Get-MailboxDatabaseCopyStatus *
Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State      
----                                          ------          --------- ----------- --------------------   ------------
DB03\MBP02                              Mounted         0         0                                  Healthy    
DB04\MBP02                              Dismounted      0         0                                  Failed     
DB07\MBP04                              Mounted         0         0                                  Healthy    
DB08\MBP04                              Dismounted      0         0                                  Failed     
DB02\MBP04                              Dismounted      0         0                                  Failed     
DB06\MBP04                              Dismounted      0         0                                  Failed     
DB05\MBP03                              Mounted         0         0                                  Healthy    
DB01\MBP01                              Dismounted      0         0                                  Failed     
DR_DB08\UMP01                           Mounted         0         0                                  Healthy    
drdb01\UMP01                            Mounted         0         0                                  Healthy    
Mailbox Database 1150652460\UMP01          Dismounted      0         0                                  Failed     
dr_db02\UMP01                           Mounted         0         0                                  Healthy    
dr_db03\UMP01                           Mounted         0         0                                  Healthy    
dr_db04\UMP01                           Mounted         0         0                                  Healthy    
DR_DB07\UMP01                           Mounted         0         0                                  Healthy    
dr_db05\UMP01                           Mounted         0         0                                  Healthy    
DR_DB06\UMP01                           Mounted         0         0                                  Healthy    

Remove old database
·         Clean up all move request
Get-MoveRequest | Remove-MoveRequest
·         Move users to DR databae
Get-MailboxDatabase | Mount-Database
Get-Mailbox -Database DB08 | New-MoveRequest -TargetDatabase dr_db08
Get-MoveRequestStatistics
Get-MoveRequest | Remove-MoveRequest
·         Mount public folder database :
Get-PublicFolderDatabase | Mount-Database
·         Format all the database drives on each node
Create DB
Create the “Logs” folder in the each database folder and run commands:


new-mailboxdatabase -Server 'MBP01' -Name 'DB02' -EdbFilePath 'D:\MPS\DB02\DB02.EDB' -LogFolderPath 'D:\MPS\DB02\LOGS'

new-mailboxdatabase -Server 'MBP02' -Name 'DB03' -EdbFilePath 'D:\MPS\DB03\DB03.EDB' -LogFolderPath 'D:\MPS\DB03\LOGS'

new-mailboxdatabase -Server 'MBP02' -Name 'DB04' -EdbFilePath 'D:\MPS\DB04\DB04.EDB' -LogFolderPath 'D:\MPS\DB04\LOGS'

new-mailboxdatabase -Server 'MBP03' -Name 'DB05' -EdbFilePath 'D:\MPS\DB05\DB05.EDB' -LogFolderPath 'D:\MPS\DB05\LOGS'

new-mailboxdatabase -Server 'MBP03' -Name 'DB06' -EdbFilePath 'D:\MPS\DB06\DB06.EDB' -LogFolderPath 'D:\MPS\DB06\LOGS'

new-mailboxdatabase -Server 'MBP04' -Name 'DB07' -EdbFilePath 'D:\MPS\DB07\DB07.EDB' -LogFolderPath 'D:\MPS\DB07\LOGS'

new-mailboxdatabase -Server 'MBP04' -Name 'DB08' -EdbFilePath 'D:\MPS\DB08\DB08.EDB' -LogFolderPath 'D:\MPS\DB07\LOGS'

Mount DB
Get-MailboxDatabase | Mount-Database -Force

Add DB copy
Add-MailboxDatabaseCopy DB01 -MailboxServer mbp02
Add-MailboxDatabaseCopy DB02 -MailboxServer mbp04
Add-MailboxDatabaseCopy DB03 -MailboxServer mbp01
Add-MailboxDatabaseCopy DB04 -MailboxServer mbp03
Add-MailboxDatabaseCopy DB05 -MailboxServer mbp02
Add-MailboxDatabaseCopy DB07 -MailboxServer mbp01 -DomainController dc04
Troubleshooting
Get-MailboxDatabaseCopyStatus * | where{ $_.status -like "fail*"} |Suspend-MailboxDatabaseCopy

Get-MailboxDatabaseCopyStatus * | Suspend-MailboxDatabaseCopy
Get-MailboxDatabaseCopyStatus * | Update-MailboxDatabaseCopy –DeleteExistingFiles

Exchange 2010 DAG DR Test, what has happened:


We have 4 mailbox server at two sites, MBP01/03 at site one and MPB02/04 at site 2. For the DAG, the witness server HUP01 is at site 1 and secondary witness server HUP02 is at site 2


The DR start at 3:18PM, here is the step and logs from start:


3:18
Power off mbp01/mbp03
Cluster node 'MBP03' was removed
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:18:57 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      mbp04.na.Domain.corp
Description:
Cluster node 'MBP03' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster.
'File Share Witness failed

74 times
3:21:03-3:22:33
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:21:03 PM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      MBP02.na.Domain.corp
Description:
Cluster resource 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)' in clustered service or application 'Cluster Group' failed.
The Cluster service is shutting down
Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          11/15/2010 3:22:33 PM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      mbp04.na.Domain.corp
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.


Log Name:      System
Source:        Service Control Manager
Date:          11/15/2010 3:22:34 PM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBP02.na.Domain.corp
Description:
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..
Run Stop-DAG
[2010-11-15T20:30:50] mbp01 added to stopped list
[2010-11-15T20:30:50] mbp01 removed from started list
[2010-11-15T20:53:58] mbp03 added to stopped list
[2010-11-15T20:53:58] mbp03 removed from started list

[2010-11-15T21:02:29] updated the started servers list in AD:
[2010-11-15T21:02:29]     mbp04
[2010-11-15T21:02:29]     mbp02
[2010-11-15T21:02:29] updated the stopped servers list in AD:
[2010-11-15T21:02:29]     mbp01
[2010-11-15T21:02:29]     mbp03
Enable trace
Extra.exe





The .etl file is a binary file and can only be converted using tools available to Microsoft support engineers. Or using tracerpt to convert etl file to XML file
tracerpt logfile1.etl -o -report
4:10 Rese DAG
[2010-11-15T21:10:15] Successfully resolved the servers based on the stopped servers list.
[2010-11-15T21:10:15] The following servers are in the StartedServers list (The list is all of the servers that are not in the StoppedServers list.):
[2010-11-15T21:10:15]     mbp04
[2010-11-15T21:10:15]     mbp02
[2010-11-15T21:10:15] The following servers are in the StoppedServers list:
[2010-11-15T21:10:15]     mbp01
[2010-11-15T21:10:15]     mbp03
[2010-11-15T21:10:15] Checking if clussvc is running on MBP04...
[2010-11-15T21:10:15] Clussvc is Stopped on MBP04.
[2010-11-15T21:10:15] Checking if clussvc is running on MBP02...
[2010-11-15T21:10:15] Clussvc is Stopped on MBP02.
[2010-11-15T21:10:15] None of the 2 servers in the started list had clussvc running, so it is safe to continue with rese-dag, but forcequorum will be necessary.

[2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP04...
[2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP04. Starting it.
[2010-11-15T21:10:17] Starting cluster service with /fq (force quorum).
[2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state.
[2010-11-15T21:10:17] ForceQuorumIfNecessary: Checking if clussvc is running on MBP02...
[2010-11-15T21:10:17] ForceQuorum: clussvc is stopped on node MBP02. Starting it.
[2010-11-15T21:10:17] Starting cluster service in normal mode.
[2010-11-15T21:10:17] Waiting up to 00:01:00 for clussvc to be in the Running state.

Cluster Info:
Nodes
[2010-11-15T21:10:20]     node: MBP01.na.Domain.corp [ state = Down ]
[2010-11-15T21:10:20]     node: MBP02.na.Domain.corp [ state = Joining ]
[2010-11-15T21:10:20]     node: MBP03.na.Domain.corp [ state = Down ]
[2010-11-15T21:10:20]     node: MBP04.na.Domain.corp [ state = Up ]


eviction operation:
[2010-11-15T21:10:21] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'...
[2010-11-15T21:10:43] Server 'mbp03' is still a node in the cluster, and will have to be evicted.
[2010-11-15T21:10:43] Updated Progress 'Evicting MBP03.' 55%.
[2010-11-15T21:10:43] Working
[2010-11-15T21:10:43] Running the eviction operation by issuing an RPC to the replay service on 'MBP04.na.Domain.corp'...
[2010-11-15T21:11:06] The following log entry comes from a different process that's running on machine 'MBP04.na.Domain.corp'.

Check File Share Witness:
[2010-11-15T21:11:06] Checking that the file share witness server (hup02.na.Domain.corp) isn't one of the servers in the DAG.
[2010-11-15T21:11:06] Checking if the FSW server can be queried with WMI.
[2010-11-15T21:11:06] The boot time of the FSW was '11/15/2010 7:30:40 PM' (MaxValue if the call failed).
[2010-11-15T21:11:06] Creating the file share of the FSW...
[2010-11-15T21:11:06] There are 2 started servers in the cluster, which is an even number. That requires a file share witness!
[2010-11-15T21:11:06] Working
[2010-11-15T21:11:11] CreateFileShareWitnessQuorum: There is already a FSW, but the current share path (\\hup01.na.Domain.corp\DAG01.na.Domain.corp) is not what's desired (\\hup02.na.Domain.corp\DAG01.na.Domain.corp). Will try to fix it.
[2010-11-15T21:11:11] The fsw resource is now in state Online.
[2010-11-15T21:11:11] The FSW resource is now in state Online.
[2010-11-15T21:11:11] The current quorum resource is 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'. About to set it to the FSW.
[2010-11-15T21:11:11] The quorum resource is now 'File Share Witness (\\hup01.na.Domain.corp\DAG01.na.Domain.corp)'.


For DAGs with only two members, DAC mode also uses the boot time of the DAG's witness server to determine whether it can mount databases on startup. The boot time of the witness server is compared to the time when the DACP bit was set to 1.
·         If the time the DACP bit was set is earlier than the boot time of the witness server, the system assumes that the DAG member and witness server were rebooted at the same time (perhaps because of power loss in the primary datacenter), and the DAG member isn't permitted to mount databases.
·         If the time that the DACP bit was set is more recent than the boot time of the witness server, the system assumes that the DAG member was rebooted for some other reason (perhaps a scheduled outage in which maintenance was performed or perhaps a system crash or power loss isolated to the DAG member), and the DAG member is permitted to mount databases





[2010-11-15T21:11:11] Bringing the quorum resource online...
MBP01/03 back online
[2010-11-15T21:25:31] commandline:         $scriptCmd = {& $wrappedCmd @PSBoundParameters }
[2010-11-15T21:25:31] Option 'Identity' = 'dag01'.
[2010-11-15T21:25:31] Option 'MailboxServer' = 'mbp01'.
[2010-11-15T21:25:39] Updated Progress 'Joined server 'mbp01' to the cluster
[2010-11-15T21:26:17] Updated Progress 'Joining mbp03
Force Witness
During the datacenter switchover process, the DAG was configured to use an alternate witness server. The DAG must be reconfigured to use a witness server in the primary datacenter. If you are using the same witness server and witness direcy that was used prior to the primary datacenter outage, you can run the Set-DatabaseAvailabilityGroup -Identity DAGName command. If you plan on using a witness server or witness direcy that is different from the original witness server and direcy, use the Set-DatabaseAvailabilityGroup command to configure the witness server and witness direcy parameters with the appropriate values.