Exchange and VMWare Guest Introspection

Ex2013 LogoIn this long overdue article, I would like to share an experience, where a customer was upgrading from Exchange 2010 to Exchange 2013. Note that this could also apply to customers migrating from Exchange 2007 or migrating to Exchange 2016 as well. The Exchange 2013 servers were hosted on VMWare vSphere 5.5U2; the Exchange 2010 servers on a previous product level.

The customer saw a negative impact on the end user experience of Outlook 2010 users, especially those working in Online Mode. Other web-based services like Exchange Web Services (EWS) were affected as well. The OWA experience was good.

Symptoms
After migrating end user mailboxes from Exchange 2010 to Exchange 2013 (but as indicated, this applies to Exchange 2016 as well), end users reported delays in their Outlook client responses, where sometimes Outlook seemed to ‘hang’ when performing certain actions like accessing a Shared Mailbox. Also, when opening up the meeting planner in order to schedule a room using Scheduling Assistant, it could take a significant amount of time, (i.e. minutes) before the schedule of all the rooms was being displayed.

The end users’ primary mailbox was configured to use Cached Mode, except for VDI users who used their primary mailbox in Online Mode. Shared Mailboxes were used in Online Mode due to the size (Outlook 2010, so no slider).

Analysis
First, the overall health of the Exchange environment was checked to exclude it as a potential cause. Exchange performance metrics were monitored, as well as Managed Availability status and events, logs like the RCA logs, and VMWare CPU Ready % to check for potential vCPU allocation issues (read: oversubscription). None of these metrics caused any reason for concern.

After reconfiguring the HOSTS file, in order to bypass the load balancer and direct traffic to a single Exchange server to simplify troubleshooting, the symptoms remained. Then, we checked:

  • TCP/IP optimization settings, e.g. RSS, Chimney, etc.
  • VMWare VMXNet3 offloading, e.g. Large Send Offload, TCP Checksum Offloading
  • VMWare VMXNet3 buffer settings

All those settings were also found to be on their recommended values.

We started digging in from the client’s perspective, and used WireShark to see what was going on on the wire. After filtering on the Exchange host, we saw the following pattern:

image

Note that this customer used SSL Offloading, so mailbox access took place on port 80 instead of 443 (RPC/http).

As you might notice, there is a consistent 200ms delay after the client receives its response (e.g. packets 106 and 110). When searching around for ‘200ms’ and ‘delay’, you may end up with articles describing the effect of the Nagle algorithm (Delayed ACK). Nagle is meant to reduce chatter on the wire, but can have a negative effect on near real-time communications, especially with small packets. Also, while 200ms might seem small, looking at the number of packets exchanged between Outlook and Exchange, this can add up quite quickly. Most of these articles will also describe a fix, recommending to configure a registry key TcpAckFrequency, and set it to 1 (default is 2). For testing purposes, we configured this key and after the mandatory reboot, the end user Outlook experience was snappy. However, setting this key would impact all client communications (real as well as VDI clients); not a recommended long-term solution due to side effects on the network.

After removing the registry key, investigating was continued. Since there was no issue with the Exchange 2010, we started to suspect there was perhaps an issue with VMWare, or there was some form of network optimization or packet inspection going on. This, due to the fact there was no problem with the old Exchange environment, and the elements that changed when migrating were VMWare vSphere version, physical vSphere hosts, and last but not least, the protocol switched. This client didn’t use Outlook Anywhere, so RPC/http was not enabled for Exchange 2010 prior to migration, and clients connected using MAPI. After some more investigating, some potentially related articles on the VMWare knowledgebase were found, talking about latency issues in certain VMWare Tools versions, the VMWare guest driver set, and downgrading these to 5.1 would have the same effect as configuring TcpAckFrequency. Unfortunately, this wasn’t an option as the hardware level of the VMWare guests already was on a certain level.

introRemediation
When installing VMWare Tools, the package comes with some system-level drivers which handle communications between the guest and the host or other guests. One of these drivers is the VMWare Guest Introspection driver (or VMCI Drivers, and formerly VShield Drivers). This component can be identified in the guest in the presence of the system drivers vnetflt and vsepflt, and accommodates agentless antivirus solutions like McAfee MOVE. However, it seems to also interfere with certain workloads in their driver ecosystem, thus negatively impacting real-time communications. I wasn’t able to test if the change from MAPI to RPC/http (or later MAPIhttp) also contributed to this effect, as the Introspection driver may not scan MAPI RPC packets at all, in which case there is no overhead introduced.

Needless to say disabling the Guest Introspection component might be less desirable for some organizations, and in those cases, when you experience this issue, I suggest contacting your VMWare representative, after verifying your VMWare Tools are part of the list of recommended versions.

In the end, in this situation Guest Introspection was disabled and a file-level scanner was introduced (with the required exclusions, of course). Performance for Online Mode was optimal when accessing Online Mode mailboxes, and using Exchange web services like Scheduling Assistant showed room planning in seconds rather than minutes.

image.pngNote that unfortunately, recent versions of VSphere running Exchange virtualized workloads also have this issue. On the plus side, they allow for separate (de)installation of the file system driver (NSX File Introspection Driver) and the network driver (NSX Network Introspection Driver). I am pretty sure removing the network driver would suffice, which might be a viable solution for some folks as well.

If you have any insights to share, please leave them in the comments.

VMWare HA/DRS and Exchange DAG support

Last year an (online) discussion took place between VMWare and Microsoft on the supportability of Exchange 2010 Database Availability Groups in combination with VMWare’s High Availability options. Start of this discussion were the Exchange 2010 on VMWare Best Practices Guide and Availability and Recovery Options documents published by VMWare. In the Options document, VMWare used VMware HA with DAG as an example and contains a small note on the support issue. In the Best Practices Guide, you have to turn to page 64 to read in a side note, “VMware does not currently support VMware VMotion or VMware DRS for Microsoft Cluster nodes; however, a cold migration is possible after the guest OS is shut down properly.” Much confusion rose; was Exchange 2010 DAG supported in combination with those VMWare options or not?

In a reaction, Microsoft clarified their support stance on the situation by this post on the Exchange Team blog. This post reads, “Microsoft does not support combining Exchange high availability (DAGs) with hypervisor-based clustering, high availability, or migration solutions that will move or automatically failover mailbox servers that are members of a DAG between clustered root servers.” This meant you were on your own when you performed fail/switch-overs in an Exchange 2010 DAG in combination with VMWare VMotion or DRS.

You might think VMWare would be more careful when publing these kinds of support statements. Well, to my surprise VMWare published a support article 1037959  this week on “Microsoft Clustering on VMware vSphere: Guidelines for Supported Configurations”. The support table states a “Yes” (i.e. is supported) for Exchange 2010 DAG in combination with VMWare HA and DRS. No word on the restrictions which apply to those combination, despite the reference to the Best Practices Guide. Only a footnote for HA, which refers to the ability to group guests together on a VMWare host.

I wonder how many people just look at that table, skip those guides (or overlook the small notes on the support issue) and think they will run a supported configuration.

Connecting StorCenter to ESXi using iSCSI

Recently I got myself an Iomega IX2-200 StorCenter. It’s a nice little device which will do nicely for my lab. When playing aroung with the device I wanted to connect it to my ESXi 4 servers using ISCSI. Yes, I’m running VMWare ESXi, main reason for that being one of my BSD guests and Hyper-V doesn’t do BSD.

Below are the steps I used to utilize the StorCenter as an ESXi datastore. I’ll be using the ESXi iSCSI Software Adapter, use CHAP (could’nt get Mutual CHAP to work, anyone?) and assume networking has been properly configured. Also, in this example we’ll be using VMFS volumes for VMDK storage, not Raw Device Mappings.

Note that during taking the screenshots I discovered a 1Gb test iSCSI target was too small (ESXi complained in the Add Storage / Select Block Size dialog), so I upped it to 16 GB using the StorCenter dashboard.

First enable iSCSI on the StorCenter. In the dashboard, select the Settings tab, click iSCSI and check Enable iSCSI. Leave iSNS discovery unchecked as ESXi doesn’t support it. Leave the option Enable two-way authentication (Mutual CHAP) unchecked.

Next, I’m going to add an iSCSI target. Select the Shared Storage tab and click Add. Change Shared Storage Type to iSCSI Drive and give it a name, e.g. esxtest. Then specify the initial size, e.g. 16Gb (you can increase this size when required). Leave Enable security checked. Click Next. Leave all User Access set to None; I’ll do that in the next step. Click Apply

Because I’m going to use CHAP I need to create an account on the StorCenter for the iSCSI initiator (i.e. ESXi) to authenticate itself. Select tab Users and click Add. Specify a Username and a password. This password MUST be between 12-16 characters. Uncheck Administrator and Add a secured folder for this user. Click Next; when asked about Group memberships click Next again. Now specify which users have access to which folders and iSCSI drives. Check the Read/Write option for the users created earlier and click Next.


In VI client, select the host’s Configuration tab and select Storage Adapters.


Select the iSCSI Software Adapter, e.g. vmhba34, and click Properties. Click Configure and make sure iSCSI is enabled (enabling may require restart). Click OK to close this dialog. Now, before connecting to the iSCSI target I’m going to specify the credentials first. I’ll use global settings so new connections will inherit these settings by default. To start configuring authentication, in the iSCSI Initiator Properties dialog on the General tab, click CHAP.
In the CHAP Credentials dialog, set CHAP to Use Chap and specify the Name and Secret (i.e. password) of the user created on the StorCenter. Since I’m not using Mutual CHAP, leave that setting to Do not use CHAP. Click OK.


Now I’m going to connect to the iSCSI target. Being lazy, select the Dynamic Discovery tab and click Add. Specify the address of the StorCenter. Click OK when done. The iSCSI server you just specified will now be added to the list of Send Targets. Click Close; when asked about rescanning the HBA select Yes.The iSCSI target will now be listed in the View section.
In the VI client, select Storage and click Add Storage. Select the Disk/LUN Storage Type and click Next. Select the added iSCSI target. Next. Next. Specify the name of the Datastore. Next. Specify the block size and required capacity. Next.

When done, click Finish. Presto! One iSCSI VMFS datastore at your disposal.