Decommissioning Exchange 2010 DAG

Posted on February 5, 2013 by Michel de Rooij

I received a question on if it was possible to decommission a DAG, so that the Exchange 2010 servers would become stand-alone Exchange servers and the databases remain available on one server, freeing up other mailbox servers. I assume the customer has valid reasons for wanting to do so, like downsizing without requirements justifying the DAG. To answer that question: of course that is possible. Now, while many blogs are happy to tell you how to create a DAG there aren’t many on how to dismantle one, so here goes.

For this blog I use a small setup which consists of a single DAG (DAG1) with member servers L14EX1 and L14EX2 hosting two databases, MDB1 and MDB2; both servers host an active copy.

In this example we’re going to decommission DAG1, where in the end the L14EX1 will host both databases and L14EX2 is freed up.

Before we decommission the DAG, we’ll reorganize the active databases so when removing database copies we’re removing passive copies. We’ll start by checking if the health status of the DAG:

Get-MailboxDatabaseCopyStatus *

We see databases are mounted and copies are in a healthy state. Next, we’ll active the copies on the L14EX1, because we’ll be freeing up the L14EX2:

Move-ActiveMailboxDatabase –Server L14EX2 –ActivateOnServer L14EX1 –Confirm:$false

Verify the databases are now properly mounted on the L14EX1:

Next, we’ll be removing the passive copies hosted on the L14EX2. Use Get-MailboxDatabaseCopyStatus instead of Get-MailboxDatabase because Remove-MailboxDatabaseCopy needs the database name specified together with the server name hosting the copy, e.g. “SERVER\DATABASE”. Note that after removing the copy, the files are still present on the file system which you need to clean up manually:

Get-MailboxDatabaseCopyStatus –Server L14EX2 | Remove-MailboxDatabaseCopy –Confirm:$false

With all passive database copies removed, we can now remove the L14EX2 from the DAG. Note that when removing a non-last member server, the node will also be evicted from the cluster and the quorum will be adjusted when necessary.

Remove-DatabaseAvailabilityGroupServer –Identity DAG1 –MailboxServer L14EX2

Next, do the same thing for the remaining node, the L14EX1. Note that this server still hosts (active) database copies which is ok; the cmdlet will detect this is the last member server of the DAG and will also remove the cluster object.

After the last member server has been removed from the DAG, we now have an empty DAG object which we can remove:

Remove-DatabaseAvailabilityGroup –Identity DAG1 –Confirm:$false

Et voila, L14EX1 now hosts both databases and the L14EX2 is freed up and you can uninstall Exchange from that server if required.

Kindly leave your comments if you have any questions.

TechEd North America 2011 sessions

Posted on May 20, 2011 by Michel de Rooij

With the end of TechEd NA 2011, so ends a week of interesting sessions. Here’s a quick overview of recorded Exchange-related sessions for your enjoyment:

Virtualized Exchange 2010 SP1 UM, DAGs & Live Migration support

Posted on May 14, 2011 by Michel de Rooij

Today, just before TechNet North America 2011, Microsoft published a whitepaper on virtualizing Exchange 2010, “Best Practices for Virtualizing Exchange Server 2010 with Windows Server® 2008 R2 Hyper V“.

There are some interesting statements in this document which I’d like to share with you, also after the Exchange team published an article on supported scenarios regarding virtualized shortly shortly after this paper was published.

First, as Exchange fellow Steve Goodman blogged about, a virtualized Exchange 2010 SP1 UM server role is now supported, albeit under certain conditions. More information on this at Steve’s blog here.

The second thing is that live migration, or any form of live migration offered by products validated through the Windows Server Virtualization Program (SVVP) program, is now supported for Exchange 2010 SP1 Database Availability Groups. Until recently, the support statement for DAGs and virtualization was:

“Microsoft does not support combining Exchange high availability (DAGs) with hypervisor-based clustering, high availability, or migration solutions that will move or automatically failover mailbox servers that are members of a DAG between clustered root servers. DAGs are supported in hardware virtualization environments provided that the virtualization environment doesn’t employ clustered root servers, or the clustered root servers have been configured to never failover or automatically move mailbox servers that are members of a DAG to another root server.”

The Microsoft document on virtualizing Exchange Server 2010 states the following on page 29:

“Exchange server virtual machines, including Exchange Mailbox virtual machines that are part of a Database Availability Group (DAG), can be combined with host-based failover clustering and migration technology as long as the virtual machines are configured such that they will not save and restore state on disk when moved or taken offline. All failover activity must result in a cold start when the virtual machine is activated on the target node. All planned migration must either result in shut down and a cold start or an online migration that utilizes a technology such as Hyper-V live migration.”

The first option, shutdown and cold start, is what Microsoft used to recommend for DAGs in VMWare HA/DRS configurations, i.e. perform an “online migration” (e.g. vMotion) of a shut down virtual machine. I blogged about this some weeks ago here since VMWare wasn’t always clear about this. Depending on your configuration, this might not be a satisfying solution when availability is a concern.

The online migration statement is new as well as the host-based fail-over clustering. In addition, though the paper is aimed at virtualization solutions based on Hyper-V R2, the Exchange Team article is more clear on supported scenarios for Exchange 2010 SP1 with regards to 3rd party products (VMware HA); if the product is supported through the SVVP program, usage of Exchange DAGs are supported. Great news for environments running or considering virtualizing their Exchange components.

Be advised that in addition to the Exchange team article, the paper states the following additional requirements and recommendations as best practice:

Exchange Server 2010 SP1;
Use Cluster Shared Volumes (CSV) to minimize offline time;
The DAG node will be evicted when offline time exceeds 5 seconds. If required, increase the heartbeat timeout to maximum 10 seconds;
Implementation of latest patches for the hypervisor;
For live migration network:
– Enable jumbo frames and make sure network components support it;
– Change receive buffers to 8192;
– Maximize bandwidth.

Note that on May 17th the DAG support statement for Exchange 2010 SP1 on TechNet was updated to reflect this. However, the last two sentences might restart those “are we supported” discussions again:

“Hypervisor migration of virtual machines is supported by the hypervisor vendor; therefore, you must ensure that your hypervisor vendor has tested and supports migration of Exchange virtual machines. Microsoft supports Hyper-V Live Migration of these virtual machines.”

So, if vendor A, e.g. VMWare, has tested and supports vMotioning DAGs with their hypervisor X, Microsoft will support Live Migration for virtual machines on hypervizor X using Hyper-V? Now what kind of statement is that?

(Updates: May 16th – statements from EHLO blog, May 17th – mention updated TechNet article)

VMWare HA/DRS and Exchange DAG support

Posted on April 19, 2011 by Michel de Rooij

Last year an (online) discussion took place between VMWare and Microsoft on the supportability of Exchange 2010 Database Availability Groups in combination with VMWare’s High Availability options. Start of this discussion were the Exchange 2010 on VMWare Best Practices Guide and Availability and Recovery Options documents published by VMWare. In the Options document, VMWare used VMware HA with DAG as an example and contains a small note on the support issue. In the Best Practices Guide, you have to turn to page 64 to read in a side note, “VMware does not currently support VMware VMotion or VMware DRS for Microsoft Cluster nodes; however, a cold migration is possible after the guest OS is shut down properly.” Much confusion rose; was Exchange 2010 DAG supported in combination with those VMWare options or not?

In a reaction, Microsoft clarified their support stance on the situation by this post on the Exchange Team blog. This post reads, “Microsoft does not support combining Exchange high availability (DAGs) with hypervisor-based clustering, high availability, or migration solutions that will move or automatically failover mailbox servers that are members of a DAG between clustered root servers.” This meant you were on your own when you performed fail/switch-overs in an Exchange 2010 DAG in combination with VMWare VMotion or DRS.

You might think VMWare would be more careful when publing these kinds of support statements. Well, to my surprise VMWare published a support article 1037959 this week on “Microsoft Clustering on VMware vSphere: Guidelines for Supported Configurations”. The support table states a “Yes” (i.e. is supported) for Exchange 2010 DAG in combination with VMWare HA and DRS. No word on the restrictions which apply to those combination, despite the reference to the Best Practices Guide. Only a footnote for HA, which refers to the ability to group guests together on a VMWare host.

I wonder how many people just look at that table, skip those guides (or overlook the small notes on the support issue) and think they will run a supported configuration.

Exchange 2010 Replication & SP1 Block Mode

Posted on November 3, 2010 by Michel de Rooij

As of Exchange 2007, replication – or to be exact, continuous replication – is used to create database copies to offer high availability and resilience options. This form of replication uses log shipping, meaning each log file is filled with transaction information up until the log file size limit of 1 MB. Then, the log file is shipped by the Exchange Replication Service to the passive copies where it is inspected and replayed against the passive copy.

For example, in the diagram below we have a DAG with 2 members. There’s an active database copy, DB(A) and a passive database copy, DB(P). Log files are generated on the node hosting DB(A) which are copied to the 2nd member where they are replayed against the database DB(P). The first three log files (EX*1-3) were copied to the 2nd node, the first two log files (EX*1-2) were inspected and replayed and the 3rd (EX*3) is still being processed. Meanwhile, new transactions are being stored in a new log file (EX*4).

You’ll see that because the Exchange Replication Service will only replicate 100% filled log files before shipping them, there’s a potential risk of information loss.

With the introduction of Exchange 2010 SP1 a new mode is added to the replication engine, namely continuous replication block mode. To prevent confusion, as of SP1, the existing mode is referred to as continuous replication file mode.

In block mode, each transaction is shipped directly to the passive copies where it will be buffered as well. When the log file size limit is reached, the host with the passive copy will generate it’s own log file (and inspect it), so the process of generating, inspecting and replaying log files remains unchanged.

The benefit of this mechanism is that there’s less chance of losing information and chance of losing less information, because buffered, unlogged transactions are also stored – in parallel – in buffers on passive copies. During a fail-over, when in block mode, the buffered information will be processed as part of the recovery process. A new log file will be generated using the (partial) information from the buffer, after which the regular recovery process takes place.

On the downside the Exchange Replication Service becomes more chatty on the network as each transaction is shipped individually instead of bundling them together, which is more efficient. That’s however a small price to pay for near-instant replication.

The process of switching to or from block mode is automatic. Initially, the replication is in file mode. When passive copies are current, it switches to block mode. It’ll automatically switch back to file mode when the replication process falls too far behind, i.e. the copy queue contains too many log files.

If you want check if replication is in file or block mode, there’s a BlockReplication section in the Eventlog. Unfortunately, it remains empty, even after setting the logging level of MSExchange*\* to Expert level (and restarting MsExchangeRepl and MSExchangeIS).

There’s a TechNet article here which mentions you can monitor the performance counter “MSExchange Replication\Continuous replication – block mode Active” using Performance Monitor or Get-Counter. For example, to check if block mode is active use the following:

Get-Counter -ComputerName <DAGID> -Counter “\MSExchange Replication(*)\Continuous replication – block mode Active”

Curious is the behaviour to activate block mode is controllable, I used Sysinternal’s procmon to investigate which registry keys were accessed. It turns out that when starting MsExchangeRepl, there are some interesting registry accesses regarding block mode, when looking for the word “granular”:

That “DisableGranularReplication” setting might imply there’s a way to prevent block mode. Note that all the keys shown above are not present in registry and I can’t find any information on them. I guess Microsoft doesn’t want people to fiddle with these settings, which makes sense since you are likely to break or negatively influence the process. And the last thing you want is a unreliable, lagging replication process because someone tried “tuning” things.

Decommisioning a DAG member

Posted on September 3, 2010 by Michel de Rooij

While I was at it, I thought I might as well blog about it. In the situation you need to dismantle an Exchange 2010 DAG member, proceed as follows:

Start up Exchange Management Console (ofcourse you can do this from the Exchange Management Shell as well, but in this example I’ll use EMC);
Go to Organization Configuration > Mailbox > Database Management;
Select the database where ” Mounted on Server” reads the server you’re decommissioning;
Select Move Active Mailbox Database;
Select the Mailbox server to host the mailbox database copy and select Move;
When the move has finished, select the database copy hosted on the server you want to decommission in the lower pane. There, select Remove.
When finished you’re greeted with a warning message. That is because the copy has been removed, but the files are still present. Remove them manually when required;
Next, we need to remove the server from the DAG. Select tab Database Availability Groups;
Select the DAG the server is a member of and select Manage Database Availability Group Membership;
Select the server and click the red cross to remove it from the list. Click Manage to proceed with the actual removal;
When finished the mailbox server is no longer member of the DAG.

You can then proceed with uninstalling Exchange. Note that you can only remove DAG members of a healthy DAG.

DAC: Changes in Exchange 2010 SP1 (Part 3)

Posted on September 1, 2010 by Michel de Rooij

Part 1: Active Manager, Activate!
Part 2: Datacenter Activation Coordination Mode

In the first two articles I discussed on Exchange 2010 Active Manager and Datacenter Activation Coordination (DAC) mode in Exchange 2010 RTM. But Exchange 2010 Service Pack 1 (SP1) introduces changes related to DAC, which I’ll discuss in this post.

Supported Configurations
To start with, DAC mode support has been extended in Exchange 2010 SP1 to support all 2 DAG configurations with 2 or more members. This is great, since you can now enable DAC mode for 2-member DAGs. Like I explained in the 2^nd part, split brain syndrome isn’t unlikely, all the more with 2 nodes given the 50/50 situation. Implementing SP1 enables you to leverage DAC mode for the simplest form of mailbox database resilience, using DAGs with 2 members over 2 sites configurations. When required, DAC in SP1 will use the Witness Server to provide the necessary arbitration.

Another thing is that SP1 doesn’t have the requirement of being enabled for DAGs in at least 2 Active Directory sites. This is good news for customers who have their Active Directory organized in a single site located over multiple locations, e.g. stretched VLANs.

Planning
When implementing SP1 on DAG members, you must implement SP1 on all DAG members as soon as possible. Reason is that DAG members running Exchange 2010 RTM can move their databases to a DAG member running Exchange 2010 SP1, but not vice versa. So, do not postpone implementation of SP1 on additional DAG members after implementing it on the first, as it impacts your failover and switchover options. Worst case when not doing so, is ending up in the situation where you cannot activate databases on a server because it doesn’t contain SP1.

Alternate Witness Server
In SP1 you can configure the Alternate Witness Server and Directory using the Exchange Management Console. This location can be used to preconfigure the Alternate Witness Server used during switchover or failover to the secondary datacenter. The configured value will be picked up automatically using the Restore-DatabaseAvailabilityGroup cmdlet during a datacenter switchover, when not explicitly specifying AlternateWitnessServer and AlternateWitnessDirectory there.

Note that this location could already be configured in Exchange 2010 RTM using the Set-DatabaseAvailabilityGroup using the AlternateWitnessDirectory and AlternateWitnessServer options.

Conclusion
DAC is a useful option that each administrator running DAGs on Exchange should consider enabling. But be aware of the caveats, like the requirement of all nodes to be able to communicate with each other during start up. All in all, DAC is a helpful option as it not only prevents issues like split brain syndrome, but it also makes the process of switching datacenters easier and less prone to error. Exchange 2010 SP1 extends the number of possible configuration in which to implement DAC, making DAC an option for the masses.

I hope you found this 3-part post useful, if you still got questions do not hesitate asking me.

Part 1: Active Manager, Activate!
Part 2: Datacenter Activation Coordination

In the first two articles I discussed on Exchange 2010 Active Manager and Datacenter Activation Coordination (DAC) mode in Exchange 2010 RTM. But Exchange 2010 Service Pack 1 (SP1) adds some nice features to DAC mode, which I’ll discuss in this 3^rd article.

Note that this location could already be configured in Exchange 2010 RTM using the Set-DatabaseAvailabilityGroup using the AlternateWitnessDirectory and AlternateWitnessServer options.

DAC: Datacenter Activation Coordination Mode (Part 2)

Posted on August 30, 2010 by Michel de Rooij

Part 1: Active Manager, Activate!
Part 3: DAC and Exchange 2010 SP1

In an earlier article I elaborated on Exchange 2010’s Active Manager, what role it plays in the Database Availability Groups concept and how this role is played. In this article I want to discuss the Datacenter Activation Coordination (DAC) mode, what it is, when to use it and when not.

Note that the following information is based on Exchange 2010 RTM behavior. A separate Exchange 2010 SP1 follow-up will be posted describing changes found in Exchange 2010 SP1.

To understand the requirement for Datacenter Activation Coordination, imagine an organization running Exchange 2010. For the purpose of high availability and resilience they have implemented a DAG running on four Mailbox Servers, stretched over 2 sites running in separate data centers, as depicted in the following diagram:

Types of Failure
Before digging into Datacenter Coordination Mode, I first want to name certain types of failures. This is important, because DAC’s goal is to address situations caused by a certain type of failure. You should distinguish between the following types of failure:

Singe Server Failure – A single server fails. The server needs recovery (availability, fail over automatic);
Multiple Server Failure – Multiple servers fail. Each server needs recovery (availability, automatic);
Site Failure – All components in a site (datacenter) fail. Site recovery needs to be initiated (resilience, manual).

What you need to remember of this list is that each type of failure is different, from the level of impact to the actions required for recovery.

Quorum
With an odd number of DAG members, the Node Majority Set (NMS) model is used, which means a number of (n/2)+1 voters (DAG members) is required to obtain quorum, rounded downward when it’s not a whole number. Obtaining quorum is important because that determines which Active Manager gets promoted to PAM and the PAM can give the green light to activate databases.

With an even number of DAG members, the Node and File Share Majority Set (NMS+FSW) model is used. This means an additional voter is introduced in the form of a File Share Witness (FSW) located on a so called Witness Server. This File Share Witness is used for quorum arbitration. Regarding the location of this File Share Witness, best practice is to put it on a Hub Transport server in the same site as the primary mailbox servers. When combining roles, e.g. Mailbox + Hub Transport, put the FSW on another (preferably e-mail related) server.

So, given this information and knowing how quorum is obtained, we can construct the following table regarding quorum voting. As we can see, when using 4 nodes as in our example scenario, we require a File Share Witness and a minimum of 3 voters to obtain quorum.

DAG members	Model	Voters Required
2	NMS+FSW	2
3	NMS	2
4	NMS+FSW	3
5	NMS	3
10	NMS+FSW	6
15	NMS	8

Site Resilience
Consider our example with the primary datacenter failing. Damage is substantial and recovery takes a significant amount of time and you decide to fall back on the secondary datacenter (site resilience). That would at least require reconfiguring the DAG, because the remaining DAG members can’t obtain quorum on their own since they form a minority.

So you remove the failed primary datacenter components from the DAG, force quorum for the secondary datacenter and reconfigure cluster mode or Witness Server (depending on the number of remaining DAG members). After reconfiguring, the remaining DAG members can obtain quorum because they can now form a majority. And, because the DAG members in de secondary datacenter can obtain quorum, the Active Manager on the quorum owner becomes Primary Active Manager and the process of best copy selection, attempt copy last logs and activation starts.

Split Brain Syndrome
Consider your secondary datacenter is up and running and you start recovering the primary datacenter. You recover the server hosting the File Share Witness and both servers; network connection is still down. A problem may arise, because the two recovered servers together with the File Share Witness form a majority according to their knowledge. So, because they have quorum they are free to mount databases resulting in divergence from the secondary datacenter, the current state.

This situation is called split brain syndrome, because both DAG members in each datacenter can’t communicate with DAG members in the other datacenter. Both groups of DAG members may determine they have a majority. Split brain syndrome can also occur because of network or power outages, depending on the configuration and how the failure manifests.

Datacenter Activation Coordination
To prevent these situations, Exchange has a special DAG mode called Datacenter Activation Coordination mode. DAC adds an additional requirement for DAG members during startup, being the ability to communicate with all known DAG members or contact a DAG member which states it’s OK to mount databases.

In order to achieve this, a protocol was devised called Datacenter Activation Coordination Protocol (DACP). The way this protocol works is shown in the following diagram:

During startup of a DAG member, the local Active Manager determines if the DAG is running in DAC mode or not;
If running in DAC mode, an in-memory DACP flag is set to 0. This tells Active Manager not to mount its databases;
If the DACP flag is set to 0, Active Manager queries the DACP flags of all other DAG members it has knowledge of. If one of those DAG members responds with 1, the local Active Manager sets the local DACP flag to 1 as well;
If the Active Manager determines it can communicate with all DAG members it has knowledge of it sets the local DACP flag to 1;
If the DACP flag is set to 1, Active Manager may mount its databases.

Note:

So, assume we enabled DAC for our example configuration and we recover the servers in the primary datacenter with the network connection still down. Those servers are still under the assumption that the FSW is located in the primary datacenter so – according to knowledge of the original configuration – they have majority. When starting up, their DACP flag is set to 0. However, they can’t reach a DAG member with a DACP flag set to 1 nor can they contact all DAG members they know about. Therefore, the DAG members in the primary site will not mount any databases, not causing split brain syndrome nor divergence.

If the recovered servers in the primary datacenter come online and the network is already up, the nodes will also not mount their databases because part of the procedure for switching datacenters is removing the primary datacenter DAG members from the DAG configuration. So, the DAG members in the primary datacenter contain invalid information and will be denied by the DAG members in the secondary datacenter.

Implementing DAC
Datacenter Activation Mode is disabled by default. To enable DAC, use the Set-DatabaseAvailabilityGroup cmdlet using the DataCenterActivationMode parameter, e.g.

Set-DatabaseAvailabilityGroup –Identity <DAGID> –DatacenterActivationMode DagOnly

Note that DagOnly and Off are the only options for the DatacenterActivationMode parameter.

Monitoring

If you’ve configured the DAG for DAC mode, and LogLevel is sufficient, you can monitor the DAG startup process using the EventLog. The Active Manager holding quorum check status every 10 seconds. It is responsible for keeping track of the status of the other DAG members. When sufficient DAC members are registered online, it will promote itself to PAM (like in non-DAC mode), which functions as the “green light” for the other Active Managers.

The Active Manager on the other DAG members will periodically check if consensus has been reached:

If the Active Manager holding quorum has promoted itself to PAM, the Active Manager on the other nodes will become SAM. After this the activation and mounting procedure will start.

Limitations
Unfortunately, it’s not an all good news show. DAC mode in Exchange 2010 RTM can only be enabled when using a DAG with 3 or more DAG members distributed over at least 2 Active Directory sites. This means DAC can’t be used in situations where you have 2 DAG members or when all DAG members are located in the same site. This makes sense for the following reasons:

In Exchange 2010 RTM, DAC only looks at the DACP flag querying DAG members. The FSW plays no part in it;
DAC is meant to prevent split brain syndrome which normally only can occur between multiple sites.

When you try to enable DAC using a 2 DAG member configuration, you’ll encounter the following message:

Database Availability Group <DAGID> cannot be set into datacenter activation mode, since it contains fewer than three mailbox servers.

When you try to enable DAC using a single site, the following error message will show up:

Database availability group <DAGID> cannot be set into datacenter activation mode, since datacenter activation mode requires servers to be in two or more sites.

Note that this message will also show up if you didn’t define sites in Active Directory Sites and Services at all, so make sure you define them properly.

But there is also good news: Exchange 2010 SP1 supports all DAG configurations. I’ll discuss this and other changes in Exchange 2010 SP1 DAC mode in a follow up article.

Additional reading
More information on Datacenter switchovers and the procedure to activate a second datacenter using DAGs in non-DAC as well as DAC mode can be read in this TechNet article. Make sure you compare the actions to perform for DAC and non-DAC setups and see that DAC makes life of the administrator much easier and the procedure less prone to error.

Blocking automatic activation in DAGs

Posted on August 24, 2010 by Michel de Rooij

After the post on Exchange 2010’s Active Manager I received a question on the possibilities to block automatic activation of database copies in a DAG. There could be legitimate reasons for wanting this, like when you want to prevent remote database copies in a secondary data center from being activated automatic.

The blockade can be created on two levels:

Server – this prevents automatic activation for any database copy hosted on that server;
Database Copy – this prevents automatic activation for a specific database copy hosted on a specific server.

To block all database copies on DAG member <ServerID> from becoming activated automatically, use:

Set-MailboxServer –Identity <ServerID> – DatabaseCopyAutoActivationPolicy Blocked

To enable all database copies on DAG member <ServerID> for automatic activation again, use:

Set-MailboxServer –identity <ServerID> –DatabaseCopyAutoActivationPolicy Unrestricted

To block automatic activation on the database copy level, use the Suspend-mailboxDatabaseCopy. For example, to block the database copy of DatabaseID from automatic activation on ServerID, use:

Suspend-MailboxDatabaseCopy –identity <DatabaseID>\<ServerID> –ActivationOnly

To enable automatic activation again for this database copy on the specified server, use Resume-MailboxDatabaseCopy, like:

Resume-MailboxDatabaseCopy –identity <DatabaseID>\<ServerID>

Be advised that contrary to what the name of the cmdlet might suggest, using Suspend in conjunction with ActivationOnly and Resuming an activation blocked database copy does not affect the replication process for that database copy.

DAC: Active Manager, Activate! (Part 1)

Posted on August 17, 2010 by Michel de Rooij

Part 2: Datacenter Activation Coordination Mode
Part 3: DAC and Exchange 2010 SP1

On the list for potential writing subjects is Datacenter Activation Coordination mode found in Exchange 2010. But to understand Datacenter Activation Coordination, you need to know a thing or two about the Active Manager. So consider this a multi-part article, where I’ll first elaborate on the Active Manager.

Roles
The Active Manager is the director of High Availability for Exchange 2010 and runs on every mailbox server. An Active Manager running on non-DAG mailbox servers is called a Standalone Active Manager. When running in Standalone mode, the Active Manager checks Active Directory every 30 seconds for any topology changes. When it detects that the server has been added to a DAG, the Active Manager will switch to DAG mode.

Active Manager running on mailbox servers within a Database Availability Group (DAG) run in DAG mode and can have two different roles, Standby Active Manager (SAM) and Primary Active Manager (PAM). All mailbox servers within a DAG have the SAM role except for one mailbox server which has the PAM role. Which mailbox server runs PAM is determined by the quorum; PAM always follows the quorum ownership, which means if a server fails, the server which obtains quorum seizes the PAM role. Active Manager mode or role selection changes are logged in the event log:

Active Manager changed from ‘[PAM|SAM|Standalone] ‘ to ‘[PAM|SAM|Standalone]’ (EventID 111);
Active manager configuration change detected. (PreviousRole=’..’, CurrentRole=’..'”, …) (EventID 227);
Active manager on this server is running as [PAM|SAM|Standalone] for duration xx:xx:xx.xx (AuthoritativeServer=…) (EventID 325).

You can also check which mailbox server currently holds the PAM role using the Get-DatabaseAvailabilityGroup cmdlet from the Exchange Management Shell in conjunction with the Status parameter. The value you should be looking for is the PrimaryActiveManager which contains the mailbox server currently holding the PAM role:

All Active Managers, both PAM and SAMs, are responsible for monitoring the health of databases mounted on the local server, from an information store as well as an Extensible Storage Engine (ESE) perspective. The actual monitoring is performed by the Exchange Replication Service which reports failures to the local Active Manager. When a failure is detected, the Active Manager notifies the PAM to initiate a fail-over. When the server with the PAM and having quorum fails, the quorum will move to another server and the PAM role will be seized by that server. Because PAM stores DAG state information in the cluster database (located on the quorum), the PAM role can move without consequences.

Activation Preference and Blocking
All this tracking of active and passive databases and coordinating database activation during switch-overs or fail-overs happens automatically. However, administrators can have some influence on the selection process. First they can configure the preferred activation order by using the Exchange Management Console.

The Activation Preference is configurable from the Exchange Management Shell by using the Set-MailboxDatabaseCopy cmdlet by specifying the ActivationPreference parameter:

Set-MailboxDatabaseCopy -Identity DB1\MBX1 -ActivationPreference 2

Another option that administrators have is to block activation at the server level using the Set-MailboxServer cmdlet in conjunction with the DatabaseCopyAutoActivationPolicy parameter:

Set-MailboxServer –Identity MBX3 – DatabaseCopyAutoActivationPolicy Blocked

The DatabaseCopyAutoActivationPolicy parameter can take 3 possible values:

Blocked
Prevent automatic database activation on selected mailbox server;
IntrasiteOnly
Only allow automatic activation within the same Active Directory site to prevent cross-site failover;
Unrestricted
No restrictions.

The process PAM follows to recover after being notified or detection of failure are:

Run Best Copy Selection (BCS) process;
Run Attempt Copy Last Logs (ACLL) process;
Issue mount request to appropriate MSExchangeIS process:
1. If the mount is successful, make the database available to clients. Note that the Exchange Replication service will recover any lost messages from Transport Dumpster for the activated database;
2. If the mounts fails, repeats steps 2-4 for the next best copy.

I’ll discuss these steps in the next paragraphs. Be advised that besides conducting the DAG, PAM is also responsible for providing information to other Exchange components, like the RPC Client Access service and Hub Transport Server for directing their client-side components to the appropriate mailbox server hosting the active copy of a database.

Best Copy Selection
The Best Copy Selection process starts by creating a list of available copies, ignoring unreachable or blocked entries. It then sorts the list on the amount of data loss which is based on the Copy Queue Length. It then uses a set of 10 criteria to try to determine which database copy to activate. When multiple mailbox servers have an equal copy queue length number, the Activation Preference is used as secondary sort key to order these.

This is the ordered list of criteria used by the BCS to determine the best copy. The conditions in both Copy Queue Length and Replay Queue Length are related to the number of log files:

Criteria Nr.	Copy Queue Length	Replay Queue Length	Content Index Status	Database Status
1	<10	<50	Healthy	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
2	<10	<50	Crawling	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
3		<50	Healthy	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
4		<50	Crawling	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
5		<50		Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
6	<10		Healthy	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
7	<10		Crawling	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
8			Healthy	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
9			Crawling	Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource
10				Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing or SeedingSource

Note that there are two special situations which might occur after PAM determined the best copy:

When multiple copies turn out to be elegible, the activation order setting, configured by the Activation Preference setting, will be decisive. For example, when copy A with an Activation Preference of 1 and copy B with an Activation Preference of 4 both end up ex aequo, copy A will become active because of its lower Activation Preference setting;
When no database copy is found to be elegible for activation, the administrator has to fix the issue by bringing one of the database copies to a state where it satisfies BCS criteria or by activating a database accepting potential significant data loss.

The Best Copy Selection results logs the following event in the eventlog:

Got Logs?
Next, the Attempt Copy Last Logs (ACLL) process is initiated. This process is performed by the Exchange Replication Service on the server selected for the active copy. ACLL will try to copy any missing log files from the best source. It does this by querying their availability and health status but also the LastLogInspected values (which is a log generation number, the LastInpectedLogTime is the timestamp generation took place). The server with the highest LastLogInspected value (i.e. latest LastInspectedLogTime) is considered the best source for copying log files from.

You can also check the timestamp values by opening the Status tab of the database Properties view after selecting the copy you wish to inspect:

AutoDatabaseMountDial
After determining the best source, ACLL tries to replicate the log files from this source. If all missing log files could be replicated from this source, the database is mounted without loss. If for some reason replication was unsuccessful or the set of log files on the source is incomplete (missing logs), an amount of data is missing. Here is where the AutoDatabaseMountDial parameter becomes important. This setting, which is defined on server level using the Set-MailboxServer cmdlet, defines what number of missing logs ACLL finds acceptable before it will mount a database.

The AutoDatabaseMountDial can have the following settings:

BestAvailability (default)
Mount the database if the copy queue length ≤12. Those logs are replicated and the database is mounted;
GoodAvailability
Mount the database if the copy queue length ≤6. Those logs are replicated and the database is mounted;
Lossless
Only mount the database if the copy queue length is 0, meaning all logs on the original active copy have been replicated. In that case the database is mounted.

If the amount of missing logs is outside the range of the configured minimum, the administrator must take action by either recovering the missing logs or by forcing a database mount.

Finally, if ACLL reports success to PAM, PAM will try to mount the database by contacting the MSExchangeIS process on the selected server with the instruction to mount a certain database as active. An exception to this is when the maximum number of active databases for that server has been reached. This setting can be configured using Set-MailboxServer by specifying a numeric value between and for MaximumActiveDatabases:

Set-MailboxServer -Identity MBX2 -MaximumActiveDatabases 50

When the maximum number of active databases has been reached, PAM will move on to the next best copy.

Example
All this information calls for an example. In this example we have a DAG with four members, server MBX1-4. The DAG consists of 1 mailbox database, for which the copy on MBX1 is currently active. MBX2-4 contain a passive copy; the copy on MBX4 is lagged.

Given the information that all of the database copies are healthy, the copy on the MBX4 is blocked for activation (which is a best practice for lagged copies) and AutoDatabaseMountDial is left at BestAvailability, the current state of the DAG is as follows:

Database	Activation Preference	Copy Queue Length	Replay Queue Length	Content Index Status
MBX2\DB1	2	5	50	Healthy
MBX3\DB1	3	50	25	Crawling
MBX4\DB1	4	25	2500	Healthy

MBX1 experiences a failure. The PAM , after detecting failure, will start with the Best Copy Selection process:

The potential copies are sorted based on their copy queue length. The list PAM will use is {MBX2, MBX3}. MBX4 is skipped since it is blocked for automatic activation. Since there is no mailbox servers with identical copy queue lengths, the Activation Preference isn’t consulted.
The criteria are held against the candidate list. MBX2 matches criteria no.6 and MBX3 matches criteria no.4. Since PAM finds a match on MBX3 first (MBX3 matches the highest criteria, i.e. 4<6), MBX3 is selected for fail-over.

The ACLL process is now asked to try copying missing log files from the original source to MBX3. Since MBX1 is still failing, it cannot, so the missing number of log files remains at 50. Because the AutoDatabaseMountDial is set to BestAvailability, which means it requires the missing log files to be 12 or less, ACLL will fail and notifies PAM.

PAM will now try the next best copy. The next best copy is MBX2, which matched criteria 6. Again, ACLL is asked to copy the missing log files to MBX2. MBX1 is still failing, so it the missing log files remain at 5. However, this number satisfies the AutoDatabaseMountDial requirement of 12 or less, so ACLL reports success to PAM. In turn, PAM will activate the database on MBX2.

The story doesn’t end here because the Hub Transport Servers will be asked to resubmit any messages from a certain point in time. But since this article is about Active Manager, I’ll dedicate another article to that some time.

Manual Override
You read that circumstances might require the administrator to take action to get a database mounted. For example, this can happen with unhealthy databases or with an AutoDatabaseMountDial setting of Lossless. The administrator can fix the issue, with potential data loss, by overriding the Best Copy Selection criteria or AutoDatabaseMountDial setting.

The cmdlet to activate a database on another server is Move-ActiveMailboxDatabase and specifying the ActivateOnServer parameter. Used without additional parameters, the cmdlet is still subject to the BCS criteria and AutoDatabaseMountDial settings. To override these, use one or more of the following parameters:

SkipLagChecks
Overrides the criteria for copy and replay queue lengths;
MountDialOverride
Overrides the AutoDatabaseMountDial setting. Possible options are GoodAvailability (max. 6 logs missing), BestAvailability (max. 12 missing), Lossless (no logs lost), None (use AutoDatabaseMountDial setting) and BestEffort. BestEffort is only available with manual activation and accepts ANY number of missing logs. Note that by default Lossless will be used when nothing is specified; If you want to use the AutoDatabaseMountDial setting of the database, specify MountDialOverride:None;
SkipClientExperience
Skip checking the Content Index health state. This means you may activate a mailbox database copy with an incomplete or damaged index so you it might need recrawling;
SkipHealthChecks
Skip database health checks.

Note that you can also perform manual activation from the Exchange Management Console, but there you can only specify the MountDialOverride parameter:

Divergence
When the original mailbox server which held the original active copy comes back online, the copy will be passive because another mailbox server holds the PAM role. Obviously, the copy will be outdated and needs fixing. This state is called divergence.

When divergence is detected by the Exchange Replication Service, an incremental reseed will take place. To bring the copy up to date, it first will need to determine the divergence point by comparing log files. It will then ask the mailbox server holding the active copy for pages containing data that has changed since the divergence point. These pages are replayed in the database until the passive copy is in sync again; the other mailbox servers will ignore those pages (which are transported using log files). When in sync, the copy will remain passive.

Note that divergence can also be caused when the administrator take a (passive copy of) a database offline.

Summary
The Active Manager manages database copies, coordinates activation and determining actions when a fail-over needs to take place. The Active Manager can have three roles, depending on if a DAG is used and if its responsible for managing that DAG. Automatic activation of databases after fail-over is determined by a set of rules; the behaviour can be influenced or overridden by the administrator who must accept potential consequences like data loss.

You normally don’t have to worry about it these details, but to understand the high availability concepts introduced with Exchange Server 2010 properly a thing or two must be known about the technologies involved. In one of the next articles I want to discuss Datacenter Activation Mode, which relies on the Active Manager.