Virtually Insane?: Optimal VM Placement

Thursday, 9 July 2009

Optimal VM Placement

UPDATE: Readded diagram due to link issue, apologies RSS Subscribers

This isn’t an educational post on how many VM's you place per VMFS volume or how to plan your VMFS luns, its a thinking matter post in response to a question that Steve Chamber's raised on possible ways to script optimal placement of your VM's on VMFS/RDM storage LUNs to gain the best performance. This got me thinking (highly dangerous yes) with some possible responses, the main topic for debate on this post is "Is VMDK placement on LUNs really something that should be decided by a Scripting logic or evenly balanced with an algorithm like Vmware DRS does?"

Storage virtualisation and natural decouplement within ESX architecture means that you can design and build a Virtual Machine that can have a Virtual disk drive such as the main OS or partition for flat file copies hosted on lower end SATA Storage or Networked storage, you can then run on the same VM other disks that require higher IO Log and DB disk volumes on more capable Fibre Channel or EFD media. This technical capability all ensures that you can achieve and obtain the dedicated IOP's needed for running the virtualised workload and more importantly allows organisations to reduce cost by not using higher end storage for lower end storage demands. The diagram below hopefully provides a simplified view of this.

Optimal VMDK Placement

It is important to ensure optimal placement of VM's upon any storage volumes and plan ahead for expected workload, also important is to ensure that the spindle count and raid level is suited to the running workload. These factors are probably longer term more important than rightsizing your virtual hardware. Within early virtualisation projects you could quite happily operate most VM's on just a RAID 5 set with 5 or 7 disks, this was mainly due to the fact that it was low hanging fruit and was heavily underutilised before when it was first on its original Physical Platform. More recently with the new major Scalability benefits that are available in Vsphere allow scaling to large amounts of vCPU and RAM which now means you are going to almost want to exploit and use this new capability to target and facilitate virtualising Tier 1 Applications and databases, you'd be silly not to.

Engagement work is needed at the architectural planning and design stage to gain a predicted indication from your Application teams and ISV’s jointly on what requirements the workload will have based on the business requirement of that application. Gaining storage relevant statistics such as how many IOPs and the expected Disk Read/Write characteristic of the running workload are paramount to deciding where to host VM's. In most projects however their are inherent problems with this in that most Virtualisation/Server Ops guys struggle to engage or obtain this information from Application owners/support unless it is easily accessible within an off the shelve application/DB such as MS Exchange or SQL. Also issues exist with bespoke applications or web services that tend to not have the available technical resources and any performance information from the application developers or the ISV to factor this into your design.

The problem is...

In most provisioning scenarios, IT Operations create VMFS LUNs, present them to ESX Hosts in preparation for VM requirements. When deployment occurs the Virtual Admin will put new VM/VMDK on a LUN that aligns to having appropriate maximum amount of VM’s on that LUN that is set to avoid excessive HBA Path Thrashing, some will just put VM’s on LUNs based on spare space available. Unfortunately when it comes to facilitating for high IO workload both will at some point likely lead to performance problems due to bad placement. Currently best practices and guidelines from VMware work effectively, so you can avoid hot spots on LUN’s to a degree with prior planning for placement at both the SAN and Virtualisation Layer, however the more that you attempt to virtualise Tier 1 Workloads that demand constant amounts of compute resource the harder it will be for people to operate such workloads effectively without constraining resources within the virtual estate.

FAST for VMware

One future possible solution to move to the panacea of automated balancing is something that will feature in EMC’s new Symmetrix V-Max high end array called FAST (Fully Automated Storage Tiering). In a nutshell FAST works by monitoring the storage LUNs and migrating workloads to more suitable tiers. I will spare full detail on how the EMC solution works as Barry Burke provides this on his EMC blog on http://tinyurl.com/cmawre. Additionally similar technology is also available today in what will be almost the same as FAST initial release within DMX with technology called Symm Optimiser, Symm Optimiser reports on hot and cold spots on your LUN's and balances them to prioritise workloads against others.

Using automated tiering technology which is automatic and balanced according to the monitoring at the VMFS layer of utilised or underutilised VMDK’s rather than monitoring at the complete LUN would seriously be cool, imagine your SAN array receiving from the ESX host a trap that you have a VMDK that is IO constrained and needed to be migrated onto a VMFS volume that was able to facilitate such as Solid State Disk, or say you plan a regular monthly migration under a defined policy to move a VM from SATA to Fibre Channel for certain periods of time when payroll runs or when you run a batch job, once its complete you move the VM back to sit on SATA.

The Goal

The premise of using automated tiering at array level is to remove any dependency on Human activity within the Operation teams that are today performing excessive amounts of either live Storage Vmotions which on a grand scale are reactive and point driven solutions to problems, other benefits to automated tiering include being able to reduce excessive amounts from the side effect of best case “guestimates” of the VM placement by having to Cold Migrate VMDK's, cold means downtime which unfortunately costs businesses of any size money and pain.

By using intuitive monitoring techniques across both ESX host and the Storage array and offloading resource balancing from the virtualisation stack to Arrays means the beefy Storage Array can control optimal placement activity which inturn offloads and reduces any imposed overhead from the ESX host, this means more compute resources are available to the running VMs to basically virtualise more or larger workloads.

Other options

Svmotion'ing in response to any storage thresholds being reported in vCenter is another option as an interim action plan until new wacky ideas and technology like automated tiering appear mainstream. Within vCenter you can now use alarm thresholds for;

VM Disk Usage (KBps)
Total Disk Latency (Ms)
VM Disk Aborts
VM Disk resets

These when meeting thresholds all can trigger alternative actions. Before you think automatic migrations here with an invoked script it is seriously recommended to Svmotion constrained VM's manually, you need to seriously consider load imposed on the VM, ESX host and the Storage Array of this type of activity at the moment. Overall I am no scripter but I am sure it would be feasible to output lists of VM's that were constrained on Disk IO to assess what is utilised and under utilised and then choose to migrate with minimal impact.

Summary

Fail to prepare, prepare to fail is the motto for this post, you seriously need to plan and design storage for Virtualised environments for a variety of workloads before you implement anything into production. To plan for workload requirements needs full scope and detailed workshop activity to occur with the application bods, ISVs and SI’s, this maybe impossible with some bespoke applications but someone will know, whether its a "one man and there dog" developer shop or Microsoft what the workload characteristics of the workload is, if they don't know then seriously consider the consequences of the application being run within your environment and highlight the risks before it gives Virtualisation a bad name!.

# posted by Daniel Eason : 12:06

Comments:

This comment has been removed by a blog administrator.

# posted by

Vaughn : 12 July 2009 at 18:06

This comment has been removed by the author.

# posted by

Daniel Eason : 12 July 2009 at 23:23

this is right on the money!

# posted by

Anonymous : 17 September 2009 at 15:23

Thanks...good to see someone agrees :)

After going to VMworld I am pleased that VMware are looking at IO DRS quite seriously, I beleive that we will certainly require this to ensure that any workload can be virtualised.

# posted by

Daniel Eason : 17 September 2009 at 16:58

Virtually Insane?

Thursday, 9 July 2009

Optimal VM Placement

Post a Comment

About Me

Links

Archives