Exchange and potential Packet Loss on VMWare


technical_support_outage_advisory[1]Yesterday, I noticed a VMware knowledgebase article, updated on November 14th, which could be worth taking notice of when you’re running Exchange – or any other application – in a virtualized environment based on VMware technology.

VMware’s KB article 2039495 mentions that in VMware ESXi 4.x and 5.x, very high traffic bursts may cause the VMXnet3 driver to start dropping packets in the Guest OS. This has been observed on Windows Server 2008 R2 running Exchange 2010 with – as VMware puts it – a high number of Exchange users. What the article fails to mention is the configuration used by customers experiencing the issue. It might for example be valuable to know if a DAG was used, if the traffic (MAPI, replication) was split over multiple NICs or if it occurred with iSCSI storage. I won’t be surprised if the issue occurs with other high traffic situations as well, e.g. seeding. Luckily, Exchange is capable of handling certain hiccups so customers might not be even aware of the issue.

After some more digging I found another article, KB 1010071, which mentions a packet drop issue with VMware Guests known since ESX 3. This article explains a bit more why the issue occurs in the first place, being the network driver running out of receive buffers, causing the packets to be dropped between the Virtual Switch and the Guest OS driver.

One could argue about the impact of a few lost packets. However, as traffic increases the (potential) number of lost packets increases. Each lost packet results in retransmission of unacknowledged packets, which impacts overall throughput causing increased latencies.

VMware’s temporary solution to this problem is:

  1. Open up the Windows guest;
  2. Open the properties of the VMXNET3 NIC;
  3. On the Advanced tab, increase the Small Rx Buffers or Rx Ring #1 Size;
  4. What KB1010071 mentions and KB2039495 doesn’t, is that when using jumbo frames – not seldom used, e.g. replication  – you might need to adjust the Rx Ring #2 size and Large Rx Buffers values.

Now I say temporary, because VMware’s solution of course isn’t  a real solution; it’s only meant to – in their own words – reduce packet drops. Also, the KB1010071 article states you should “determine an appropriate setting by experimenting with different buffer sizes”. That doesn’t sound like an permanent, assuring solution for a virtualization environment running business critical applications now, does it?

All things considered, I’d recommend configuring these parameters to their maximum setting, preferably at installation time, unless anyone knows of a reason not to. In addition, this is another case for the best practice to split MAPI and replication traffic on Exchange using multiple NICs.

Finally, I already learnt of two other applications experiencing the issue. Therefor I think the problem is not Exchange 2010 specific, as KB2039495 might imply. If you have similar experiences, experienced differences between GbE and 10Ge, please use the comments to share.

Review: Exchange Data Center Switchover Tool (Updated)


Last week, the Exchange team released what they called the “Exchange 2010 datacenter switchover tool” (note that the title mentions troubleshooter). The tool could prove helpful to some and can be insightful to others.

While I applaud any effort put in to minimize risks and the possibility of human error, especially in stressful situations like data center switchovers, I do have some suggestions for improvement.

First, the name. A “tool” might imply it’s something to aid in the switchover process, while in fact it’s more of an interactive decision maker or guide walking you through the process and can be utilized to practice dry runs or test formalized procedures.

That brings me to my second point, which is the format. A process like a data center switchover with all its decision moments is perhaps better translated to a flow chart rather than an interactive PowerPoint slide deck, which looks good on screen but can’t be printed. Also, a PDF or XPS might be more convenient; not everyone has PowerPoint at hand all the time, especially when working remotely on servers.

Finally, the contents is almost taken directly from the original Technet data center switchover article here, with the same questions and steps. It could perhaps be turned in a more valuable tool if it could read the environment and tailor questions based on what it discovers.

You can check out the “troubleshooter” yourself by downloading it here. Of course, this is only the first version; I suggest you leave feedback and suggestions on how to improve the tool in the accompanying article on the Exchange Team blog here.

Update October, 24th:UC Architects fellow Serkan Varoglu created a Exchange Data Center Switchover workflow diagram; you can download it here.