Virtualizing SQL Server with VMware: Architecting for Performance: Storage

Date: Aug 14, 2014

This chapter from Virtualizing SQL Server with VMware: Doing IT Right first covers the key aspects of storage architecture relevant to both physical and virtual environments as well as the differences you need to understand when architecting storage, specifically for virtualized SQL Server Databases.

All aspects of architecting your SQL Server Database for performance are important. Storage is more important than most when compared to the other members of the IT Food Group family we introduced in Chapter 5, “Architecting for Performance: Design,” which consists of Disk, CPU, Memory, and Network. Our experience has shown us, and data from VMware Support validates this belief, that more than 80% of performance problems in database environments, and especially virtualized environments, are directly related to storage. Understanding the storage architecture in a virtualized environment and getting your storage architecture right will have a major impact on your database performance and the success of your SQL Server virtualization project. Bear in mind as you work through your storage architecture and this chapter that virtualization is bound by the laws of physics—it won’t fix bad code or bad database queries. However, if you have bad code and bad queries, we will make them run as fast as possible.

This chapter first covers the key aspects of storage architecture relevant to both physical and virtual environments as well as the differences you need to understand when architecting storage, specifically for virtualized SQL Server Databases. Many of the concepts we discuss will be valid for past versions of SQL Server and even the newest release, SQL Server 2014.

We provide guidance on what our experience has taught us are important database storage design principles. We present a top-down approach covering SQL Server Database and Guest OS Design, Virtual Machine Template Design, followed by VMware vSphere Hypervisor Storage Design and then down to the physical storage layers, including using server-side flash acceleration technology to increase performance and provide greater return on investment. We conclude the chapter by covering one of the biggest IT trends and its impact on SQL Server. Throughout this chapter, we give you architecture examples based on real-world projects that you can adapt for your purposes.

When designing your storage architecture for SQL Server, you need to clearly understand the requirements and have quantitative rather than subjective metrics. Our experience has taught us to make decisions based on fact and not gut feeling. You will need to benchmark and baseline your storage performance to clearly understand what is achievable from your design. Benchmarking and baselining performance are critical to your success, so we’ve dedicated an entire chapter (Chapter 10, “How to Baseline Your Physical SQL Server System”) to those topics. In this chapter, we discuss some of the important storage system component performance aspects that will feed into your benchmarking and baselining activities.

The Five Key Principles of Database Storage Design

When architecting storage for SQL Server, it’s important to understand a few important principles. These will help guide your design decisions and help you achieve acceptable performance both now and in the future. These principles are important because over the past decade, CPU performance has increased at a much faster pace than storage performance, even while capacity has exploded.

Principle 1: Your database is just an extension of your storage

The first principle is highlighted in Figure 6.1: that your database is just an extension of your storage. A database is designed to efficiently and quickly organize, retrieve, and process large quantities of data to and from storage. So increasing the parallelism of access to storage resources at low latency will be an important goal. Later in this chapter, we cover how to optimize the architecture of your database to maximize its storage performance and parallelism. When you understand this principle, it’s easy to understand why getting your storage design and performance is so critical to the success of your SQL Server Database virtualization project.

Figure 6.1 Quote from Michael Webster, VMworld 2012

Principle 2: Performance is more than underlying storage devices

The next key principle is that storage performance is more than just about underlying storage devices and spindles, although they are very important too. SQL Server storage performance is multidimensional and is tightly coupled with a number of different system components, such as the number of data files allocated to the database, the number of allocated vCPUs, and the amount of memory allocated to the database. This is why we like to use the term “IT Food Groups,” because it is so important to feed your database the right balance of these critical resources. This interplay between resources such as CPU, Memory, and Network and their impact on storage architecture and performance will be covered in subsequent sections of this chapter.

Principle 3: Size for performance before capacity

Figure 6.2 is loosely based on the eighteenth-century quote “The bitterness of poor quality remains long after the sweetness of low price is forgotten,” by Benjamin Franklin. Both quotes are extremely relevant to SQL Server database and storage performance.

Figure 6.2 Quote from Michael Webster, VMworld 2013

This brings us to the next key principle. In order to prevent poor performance from being a factor in your SQL Server virtualization project (refer to Figure 6.2), you should design storage for performance first (IOPS and latency), then capacity will take care of itself. Capacity is the easy part. We will show you later in this chapter how compromising on certain storage configurations on the surface can actually cost you a lot more by causing unusable capacity due to poor performance.

CAUTION

A lesson from the field: We were working with a customer, and they wanted to design and run a database on vSphere that could support sustained 20,000 IOPS. After we worked with the customer’s vSphere, SAN, Network, and DBA teams, the customer decided to move forward with the project. The customer then called in a panic saying, “In our load test, we achieved 1,000 IOPS. We are 19,000 short of where we need to be.” Trust me, this is a phone call you don’t want to get. Playing the odds, we started with the disk subsystem. We quickly identified some issues. The main issue was the customer purchased for capacity, not performance. They had to reorder the right disk. Once the new (right) disk arrived and was configured, the customer exceeded the 20,000 IOPS requirement.

Principle 4: Virtualize, but without compromise

The next principle is that virtualizing business-critical SQL Server databases is all about reducing risk and not compromising on SLAs. Virtualize, but without compromise. There is no need to compromise on predictability of performance, quality of service, availability, manageability, or response times. Your storage architecture plays a big part in ensuring your SQL databases will perform as expected. As we said earlier, your database is just an extension of your storage. We will show you how to optimize your storage design for manageability without compromising its performance.

Believe it or not, as big of advocates as we are about virtualizing SQL Server, we have told customers in meetings that now is not the right time for this database to be virtualized. This has nothing to do with the capability of vSphere or virtualization, but more to do with the ability of the organization to properly operate critical SQL systems and virtualize them successfully, or because they are not able or willing to invest appropriately to make the project a success. If you aren’t willing to take a methodical and careful approach to virtualization projects for business-critical applications, in a way that increases the chances of success, then it’s not worth doing. Understand, document, and ensure requirements can be met through good design and followed by testing and validation. It is worth doing, and it is worth “Doing It Right!”

Principle 5: Keep it standardized and simple (KISS)

This brings us to the final principle. Having a standardized and simplified design will allow your environment and databases to be more manageable as the numbers scale while maintaining acceptable performance (see Principle 4). If you have a small number of standardized templates that fit the majority of your database requirements and follow a building-block approach, this is very easy to scale and easy for your database administrators to manage. We’ll use the KISS principle (Keep It Standardized and Simple) throughout this chapter, even as we dive into the details. Once you’ve made a design decision, you should standardize on that decision across all your VM templates. Then when you build from those templates, you’ll know that the settings will always be applied.

SQL Server Database and Guest OS Storage Design

The starting point for any storage architecture for SQL Server Databases is actually with our last design principle: KISS (Keep It Standardized and Simple). But all of the principles apply. We will determine the smallest number of templates that are required to virtualize the majority (95%) of database systems, and anything that falls outside this will be handled as an exception.

Your first step is to analyze the inventory of the SQL Server Databases that will be virtualized as part of your project (refer to Chapter 4, “Virtualizing SQL Server 2012: Doing It Right”). From this inventory, you will now put each database and server into a group with similar-sized databases that have similar requirements. The storage requirements for all of these existing and new databases, based on their grouping, will be used to define the storage layouts and architecture for each of the SQL Server Databases, Guest OS, and VM template.

TIP

If you are virtualizing existing databases, you might consider using a tool such as VMware Capacity Planner, VMware Application Dependency Planner, Microsoft System Center, or Microsoft Assessment and Planning Toolkit to produce the inventory. VMware Capacity Planner and Application Dependency Planner are available from VMware Professional Services or your preferred VMware partner. When you’re baselining a SQL Server database, a lot can happen in a minute. We recommend your sample period for CPU, Memory, and Disk be 15 seconds or less. We recommend you sample T-SQL every minute.

SQL Server Database File Layout

Database file layout provides an important component of database storage performance. If you have existing databases that will be virtualized, you or your DBAs will likely have already developed some practices around the number of database files, the size of database files, and the database file layout on the file system. If you don’t have these practices already in place, here we provide you with some guidelines to start with that have proven successful.

Your SQL Server database has three primary types of files you need to consider when architecting your storage to ensure optimal performance: data files, transaction log files, and Temp DB files. Temp DB is a special system database used in certain key operations, and has a big performance impact on your overall system. The file extensions you’ll see are .mdf (master data file), .ndf (for secondary data files), and .ldf for transaction log files. We will go over all of these different file types later in this chapter.

Number of Database Files

First, we need to determine the number of database files. There are two main drivers for the number of files you will specify. The first driver is the number of vCPUs allocated to the database, and the second is the total capacity required for the database now and in the future.

Two design principles come into play here: The parallelism of access to storage should be maximized by having multiple database files, and storage performance is more than just the underlying devices. In the case of data files and Temp DB files, they are related to the number of CPU cores allocated to your database. Table 6.1 provides recommendations from Microsoft and the authors in relation to file type.

Table 6.1 Number of Data Files and Temp DB Files Per CPU

File Type	Microsoft Recommended Setting	Author Recommended Setting
Temp DB Data File	1 per CPU core	< 8 vCPU, 1 per vCPU > 8 vCPU, 8 total (increase number of files in increments of four at a time if required) Max 32
Database Data File	0.25 to 1.0 per file group, per CPU core	Min 1 per vCPU, max 32
Database Transaction Log File	1	1^*
Temp DB Transaction Log File	1	1^*

Microsoft recommends as a best practice that you should configure one Temp DB data file per CPU core and 0.25 to 1 data file (per file group) per CPU core. Based on our experience, our recommendation is slightly different.

If your database is allocated eight or fewer vCPUs as a starting point, we recommend you should configure at least one Temp DB file per vCPU. If your database is allocated more than eight vCPUs, we recommend you start with eight Temp DB files and increase by lots of four in the case of performance bottlenecks or capacity dictates.

We recommend in all cases you configure at least one data file (per file group) per vCPU. We recommend a maximum of 32 files for Temp DB or per file group for database files because you’ll start to see diminishing performance returns with large numbers of database files over and above 16 files. Insufficient number of data files can lead to many writer processes queuing to update GAM pages. This is known as GAM page contention. The Global Allocation Map (GAM) tracks which extents have been allocated in each file. GAM contention would manifest in high PageLatch wait times. For extremely large databases into the many tens of TB, 32 files of each type should be sufficient.

Updates to GAM pages must be serialized to preserve consistency; therefore, the optimal way to scale and avoid GAM page contention is to design sufficient data files and ensure all data files are the same size and have the same amount of data. This ensures that GAM page updates are equally balanced across data files. Generally, 16 data files for tempdb and user databases is sufficient. For Very Large Database (VLDB) scenarios, up to 32 can be considered. See http://blogs.msdn.com/b/sqlserverstorageengine/archive/2009/01/04/what-is-allocation-bottleneck.aspx.

If you expect your database to grow significantly long term, we would recommend that you consider configuring more data files up front. The reason we specify at least one file per CPU is to increase the parallelism of access from CPU to data files, which will reduce any unnecessary data access bottlenecks and lower latency. This also allows for even data growth, which will reduce IO hotspots.

CAUTION

Having too few or too many Temp DB files can impact the overall performance of your database. Our guidance is conservative and aimed to meet the requirements for the majority of SQL systems. If you start to see performance problems such as higher than normal query response times or excessive database waits in PAGELATCH_XX, then you have contention in memory and may need to increase the number of Temp DB files further and/or implement trace flag 1118 (which we recommend), which prevents single page allocations. If you see waits in PAGEIOLATCH_XX, then the contention is at the IO subsystem level. Refer to http://www.sqlskills.com/blogs/paul/a-sql-server-dba-myth-a-day-1230-Temp DB-should-always-have-one-data-file-per-processor-core/ and Microsoft KB 328551 (http://support.microsoft.com/kb/328551).

Figure 6.3 shows an example of data files, Temp DB files, and transaction log files allocated to a SQL Server 2012 Database on a sample system with four vCPU and 32GB RAM.

Figure 6.3 SQL Database data file allocation.

Size of Database Files

Let’s start the discussion on data file sizes with some fundamentals that are important to understand. Data files, unlike transaction log files, are accessed in parallel and the IO pattern is more random. Temp DB files are accessed in parallel in a round-robin fashion. This is why having more database files improves the parallelism of IO access to storage. In effect, the IO is striped across the data files.

It is important to configure your database files to be equal size to start with. SQL Server will write data to the available files evenly if they are the same size, as it uses a proportional fill algorithm that favors allocations in files with more free space. If the files are the same size, then each file will have the same free space. Having equally sized files ensures even growth and more predictable performance.

The next important point is that you should preallocate all your data files and transaction log files. This will eliminate the need for the database to constantly grow the files and resize them, which will degrade performance and put more stress on your storage platform. The files can’t be accessed for the period of time they are being extended, and this will introduce avoidable latency.

It is a Microsoft best practice and our recommendation to manually and proactively manage file sizes. Because you are presizing and proactively managing your database files, you shouldn’t need to rely on Auto Grow as much. Even though it may not be needed, we recommend that Auto Grow be left active as a safety net.

TIP

Auto Grow should be set to grow at the same or a multiple of the underlying storage system block size. In VMware environments, the block size on data stores will be between 1MB and 8MB. Your Database Auto Grow size should be set similarly, or at a multiple of this. Auto Grow should not be configured for unrestricted growth; it should be limited to less than the size of the underlying file system, taking into consideration the size of any other files on the file system. See VMware KB 1003565.

If you are unsure what your underlying block size is, set Auto Grow to a multiple of 1MB. To prevent Auto Grow from being active too often, consider configuring it to grow at around 10% of your initial database size rounded up to the nearest 1MB (or block size), up to a maximum of 4GB. In most cases, an Auto Grow amount of 256MB to 512MB should be sufficient. This will ensure the grow operation doesn’t take too long and is aligned to the underlying storage subsystem.

CAUTION

Because Auto Grow will by default zero out all the blocks and prevent access to the files during that period, you don’t want the operation to take too long. You also don’t want these operations to happen too frequently. Therefore, the Auto Grow size needs to be small enough that it completes in a reasonable time but not too small as to require constant growth. The database file sizing guidelines need to be adjusted based on the performance in terms of throughput of your storage and the workload behavior of your database. If you are proactively managing the size of your database files, then Auto Grow should not be kicking in at all and this shouldn’t be a concern.

TIP

By default, Auto Grow operations will expand one file at a time. This will impact the proportional fill algorithm and could result in degraded performance and storage hot spots. To avoid this behavior, you can use trace flag 1117 by specifying startup option –T1117 or by using the DBCC TRACEON command. By using this trace flag, you will ensure that each file is grown by the same amount at the same time. This trace flag is set by default when installing SAP on SQL Server 2012. Refer to SAP Note 1238993 and http://www.ciosummits.com/media/pdf/solution_spotlight/SQL%20Server%202012%20Technologies%20for%20SAP%20Solutions.pdf.

Now that we’ve covered the fundamentals, we can calculate the initial size of the database files. The initial files sizes are fairly easy to determine if you’re migrating an existing system—in which case, we recommend you preset your files to be the same size as the system that is being migrated, which would be the case if you are doing a standard physical-to-virtual migration. If this is a new database being virtualized, you will need to estimate the database files’ initial size.

Data File Sizing

For data files, the preset size you should use is based on the estimated or actual size of your database. You should allow for reasonable estimated growth (three to six months). Once you have the total estimated size of your database, including growth, divide that by the number of files to get the size of each file. For example, if you had a database 200GB in size with four vCPUs configured, you would have four data files, assuming one file per vCPU, with a preset size of 50GB each. Each data file should always be of equal size and be extended at the same rate.

Temp DB File Sizing

The size of your Temp DB files should be based on the high watermark usage you estimate for your queries and the overall size of your database. This can be hard to estimate without knowledge of your workload because different queries will impact your Temp DB usage in different ways. The best way to determine the appropriate size will be to monitor Temp DB usage during a proof of concept test, or benchmarking and baselining activities.

As a starting point, we recommend you consider sizing Temp DB to 1% the size of your database. Each file would then be equal to Total size of Temp DB divided by the number of files. For example, if you had a 100GB database with four vCPUs configured, you would have an initial total Temp DB size of 1GB, and each Temp DB data file would be 250MB in size. If you see significantly more Temp DB use during ongoing operations, you should adjust the preset size of your files.

Transaction Log File Sizing

The total size that your database transaction log file should be preset to will primarily be based on the actual or estimated high water mark of transaction storage required before the next backup or transaction log truncation. We are assuming for the purposes of this section that you care about data protection and preventing data loss of your database and are therefore using the full recovery model. Data loss is a risk when using the other available recovery models.

If you are doing daily backups, you will need to ensure that your log file is sufficiently sized to allow up to at least a day’s worth of transactions. This will allow you to recover back to the point in time your database goes down by using the last backup and replaying the transaction logs. In some large database systems, you will need to back up the transaction logs much more frequently than every day.

When it comes to storage performance and sizing of your transaction log, the total size and how fast you can write transactions to it are important but are not the only considerations. You must also consider the performance of file growth, DB restart, and backup and recovery operations. With this in mind, it is critical that not only is the total size of your transaction log appropriate, but also how you grow your transaction log to that size. The reason this is so critical is that in SQL Server, even though your transaction log may be one physical file, it’s not one physical transaction log.

Your one physical transaction log is actually made up of a number of smaller units called Virtual Log Files (VLFs). VLFs are written to sequentially, and when one VLF is filled, SQL Server will begin writing to the next. They play a critical part in the performance of database backup and recovery operations.

The number of VLFs is determined at the time a file is created or extended by the initial size allocated to the transaction log and the growth amount “chunk” each time it is increased in size. If you leave the default settings with a large database, you can quickly find yourself with tens if not hundreds of thousands of VLFs, and this will cause a negative performance impact. This is why the process of preallocating the transaction log file and growing it by the right amount is so important.

If the VLFs are too small, your maintenance, reboots, and database recovery operations will be excruciatingly slow. If your VLFs are too big, your log backups and clearing inactive logs will be excruciatingly slow and may impact production performance. The reason for the former is that SQL Server must load the list of VLFs into memory and determine the state of each, either active or inactive, when doing a DB restart or recovery. The latter is because a VLF can’t be cleared until the SQL Server moves onto the next one.

As you can see from Table 6.2, if you create or grow a transaction log file by 64MB or less at a time, you will get four VLFs each time. If you need 200GB of transaction log, and it is created or grown by this amount, you end up with 12,800 VLFs, with each VLF being 16MB. At or before this point, you’d start to notice performance problems.

Table 6.2 Number of VLFs Allocated per Chunk Size

Chunk Size	Number of VLFs
<= 64MB	4
> 64MB and <=1GB	8
> 1GB	16

Let’s take a look at another example: Suppose you have a 128GB log file created as 128GB to begin with. This file will have 16 VLFs, and each VLF will be 8GB. This means that each VLF can only be cleared at more than 8GB and when completely inactive. The process of clearing the log segment will likely have a direct impact on the performance of the database.

To avoid the performance problems covered previously, you should ensure your VLF size is between 256MB and 512MB. This will guarantee that even if your transaction log were to reach the maximum size of 2TB, it will not contain more than 10,000 VLFs. To achieve this, you can preset your log file to either 4GB or 8GB and grow it (either manually or with Auto Grow) by the same amount each time. If we take the example of the 128GB transaction log, you would initially create a 8GB log file and then grow it by 8GB fifteen times. This will leave you with the 128GB log file and 256 VLFs within that log file, at 512MB each. You should set your transaction log file Auto Grow size to be the same as whatever growth increment you have decided upon.

CAUTION

There is a bug when growing log files by multiples of exactly 4GB that affects SQL Server 2012. If you attempt to grow the log by a multiple of 4GB, the first attempt will fail to extend the file by the amount specified (you might see 1MB added), but will create more VLFs. The second or subsequent attempt will succeed in growing the file by the specified amount. This bug is fixed in SQL Server 2012 SP1. As a workaround, if you are still using SQL Server 2012, you should grow in increments of 4,000MB or 8,000MB rather than 4GB or 8GB. See http://www.sqlskills.com/blogs/paul/bug-log-file-growth-broken-for-multiples-of-4gb/.

Even if your database were relatively small, we would recommend that you start with a 4GB or 8GB (4,000MB or 8,000MB) transaction log file size. You should proactively and manually manage the size of your transaction log. Proactive management will avoid Auto Grow kicking in during production periods, which will impact performance. This is especially important when considering the transaction log will be growing at 4GB or 8GB at a time and having all those blocks zeroed out. However, just as with data files and Temp DB files, you should have Auto Grow enabled as a safety net and set it to either 4GB or 8GB, depending on the growth size you have selected.

Instant File Initialization

When a database file is created or extended, SQL Server will by default zero out the newly created file space. This will cause performance to degrade if it occurs during periods of intense database write activity, which is most likely if database files are not proactively managed and Auto Grow is extending the files. There is also the length of time required to write zeros to all the new blocks during which access to the file is blocked, as mentioned previously. To greatly improve the speed and reduce performance impacts of file growth operations, you can configure SQL Server to instantly initialize the database files without zeroing them out.

To allow your database instance to leverage Instant File Initialization (IFI), you need to add the SQL Server Service account to the Perform Volume Maintenance Tasks security policy (using Group Policy Editor gpedit.msc or Local Group Policy Editor secpol.msc), as shown in Figure 6.4. We strongly recommend this setting be applied in group policy, especially when AlwaysOn Availability Groups are used or when there are multiple databases that will have this setting enabled.

Figure 6.4 Enabling the Perform Volumes Maintenance Tasks security policy.

After you have made this change, you will need to restart your SQL Server services for it to take effect. We recommend you make this setting a standard for all your SQL Server databases and include it in your base template.

NOTE

Instant File Initialization (IFI) is only used for data files and Temp DB files. Even when IFI is configured and active, it will not be used for transaction log files. Transaction log files will continue to zero out every block when they are created or extended. This is due to the internal structure of the transaction log file and how it is used for data protection and recovery operations. This makes it even more important for you to proactively manage your transaction log files to prevent any avoidable performance impacts.

CAUTION

Instant File Initialization is not available when you’re using Transparent Data Encryption (TDE) or when trace flag 1806 is set (which disables Instant File Initialization). Although IFI has a positive impact on performance, there are security considerations. For this reason, using IFI may not be suitable for all environments. See http://technet.microsoft.com/en-us/library/ms175935(v=sql.105).aspx. Based on our experience, in most environments, the highlighted security considerations can be addressed by proper controls and good system administration practice.

SQL Server File System Layout

We have covered how you determine how many files your databases need and how big each file should be preallocated. We now need to assign these files to the appropriate locations on the file system and configure the file system within the Windows operating system to achieve the best possible performance. The file system layout may differ slightly between database instances that are standalone, that use AlwaysOn Failover Clustering, or that use AlwaysOn Availability Groups. We will discuss the possible differences and give you recommendations based on our experience to help you achieve performance that can be built in to your standardized base templates.

OS, Application Binaries, and Page File

The OS, application binaries, and page file should be separated from the data, Temp DB, and log files in a SQL Server database. These components of the system generally produce little IO, but we don’t want any IO interfering with productive IO from the database. For this reason, we recommend that OS, application binaries, and the page file be on a separate drive (or drives) and IO controller from data files, log files, and Temp DB.

From a database storage performance perspective, any paging is bad and should be avoided. Details of the page file and SQL Server memory configuration will be covered in Chapter 7, “Architecting for Performance: Memory.” Chapter 7 will show you how to avoid paging and optimize performance from the memory configuration of your SQL Server.

File System Layout for Data Files, Log Files, and Temp DB

When considering the design of the file system layout for data files, log files, and Temp DB, our objectives are as follows:

Optimize parallelism of IO (Principle 1).
Isolate different types of IO from each other that may otherwise cause a bottleneck or additional latency, such as OS and page file IO from database IO, or sequential log file IO from random data file IO.
Minimize management overheads by using the minimum number of drive letters or mount points required to achieve acceptable performance (Principle 5).

In order to achieve objectives 1 and 2, we recommend splitting out data files and Temp DB files from log files onto separate drive letters or mount points. This has the effect of killing two birds with one stone. By separating log files into their own drive or mount point, you maintain the sequential nature of their IO access pattern and can optimize this further at the hypervisor and physical storage layer later if necessary. If the log files share a drive or mount point, the access pattern of that device will instantly become random. Random IO is generally harder for storage devices to service. At the same time, you are able to increase the parallelism needed for the IO patterns of the data files and Temp DB files.

To achieve greater IO parallelism at the database and operating system layer, you need to allocate more drive letters or mount points. The reason for this is that each storage device (mount point or drive) in Windows has a certain queue depth, depending on the underlying IO controller type being used. Optimizing the total number of queues available to the database by using multiple drives or mount points allows more commands to be issued to the underlying storage devices in parallel. We will discuss the different IO controllers and queue depths in detail later.

As a starting point for standalone database instances, we recommend that you configure a drive letter or mount point per two data files and one Temp DB file. This recommendation assumes each file will not require the maximum performance capability of the storage device at the same time. The actual number of drive letters or mount points you need will be driven by your actual database workload. But by having fewer drives and mount points will simplify your design and make it easier to manage. The more users, connections, and queries, the higher the IO requirements will be, and the higher the queue depth and parallelism requirements will be, and the more drive letters and mount points you will need.

The example in Figure 6.5 illustrates how you might arrange your database files for a standalone instance. If you start to see IO contention and your database is growing (or is expected to grow) very large or makes a lot of use of Temp DB, then you may wish to separate out Temp DB files onto their own drive letters or mount points. This would remove the chance of Temp DB IO activity impacting the IO activity of your other data files and allow you to put Temp DB onto a separate IO controller (point 2 of our file system layout objectives).

Figure 6.5 Sample SQL Server file system layout—Temp DB with data files.

Having a single Temp DB file on the same drive with two data files in general will balance the IO activity patterns and achieve acceptable performance without an excessive number of drives to manage. The reason for this layout is more likely on a standalone instance instead of with a clustered instance, which will become clear on the next page.

TIP

You should size each drive letter or mount point so that the preallocated database files on it consume no more than 80% of the available capacity. When you need to grow the capacity of your database, you have the option of either extending the existing drives or mount points or adding in more. These operations can be done online without any disruption to your running database. Auto Grow should be configured so that in the worst-case scenario, the maximum growth of all the files on the drive or mount point combined will never exceed the total capacity.

In the example in Figure 6.6, we have split out the Temp DB files onto separate drive letters from the data files of the production database. If you have a very large database or your database will have heavy IO demands on Temp DB, it makes sense to split it out onto its own drives and a separate IO controller.

Figure 6.6 Sample SQL Server file system layout—data files separate from Temp DB.

In databases that make extremely heavy use of Temp DB, such as peaking at more than 50% of total database size, it might make sense for each Temp DB file to be on its own drive or mount point to allow each file access to more parallel IO resources. This assumes that the underlying storage infrastructure can deliver more IO in parallel, which we will cover later in this chapter.

In an AlwaysOn Failover Cluster Instance, an additional reason to separate Temp DB onto different drives or mount points from other data files is that it can be hosted locally to the cluster node, rather than on the shared storage. This makes a lot of sense given that the Temp DB data doesn’t survive instance restarts. This allows you to optimize the performance of Temp DB without impacting the data files and log files that are shared between cluster nodes. If you have extreme Temp DB IO requirements, you could consider locating it on local flash storage, but consider that this would prevent the guest restarting in a VMware HA event. In this case, the cluster node would be unavailable if the local flash storage failed, which would trigger a failover to another node. This is a new feature available with SQL Server 2012 AlwaysOn that wasn’t previously available (see http://technet.microsoft.com/en-us/sqlserver/gg508768.aspx). More details about AlwaysOn Availability Groups and Failover Cluster Instances are provided in Chapter 9, “Architecting for Availability: Choosing the Right Solutions.”

NTFS File System Allocation Unit Size

Now that we have covered the SQL Server database layout on the file system, we need to cover another important aspect of the database file system design: the NTFS Allocation Unit Size (also known as Cluster Size). When you format a drive or mount point in Windows, you have the option of choosing a different NTFS Allocation Unit Size from the default (4KB in most cases). The NTFS Allocation Unit Size is important because it’s the smallest amount of disk space that can be used to hold a file. If a file doesn’t use the entire Allocation Unit, additional space will be consumed.

Having a small (default) Allocation Unit Size means there are many more times the number of blocks at the file system level that need to be managed by the operating system. For file systems that hold thousands or millions of small files, this is fine because there is a lot of space savings by having a smaller Allocation Unit in this scenario. But for a SQL Server database that consists of very few, very large files, having a much larger Allocation Unit is much more efficient from a file system, operating system management, and performance perspective.

For the OS and Application Binary drive, keeping the default of 4KB Allocation Unit is recommended. There is no benefit in changing from the default. If your page file is on a separate drive from the OS, you should use a 64KB Allocation Unit size. For all SQL Server database drives and mount points (data files, log files, and Temp DB files), we recommend you use 64KB as your Allocation Unit Size setting (see Figure 6.7).

Figure 6.7 NTFS Allocation Unit Size.

Partition Alignment

Each storage device reads and writes data at different underlying block sizes. A block on a storage device is the least amount of data that is read from or written to with each storage option. If your file system partition is not aligned to the underlying blocks on the storage device, you get a situation called Split IO in which multiple storage operations are required to service a single operation from your application and operating system. Split IOs reduce the available storage performance for productive IO operations, and this gets even worse when RAID is involved, due to the penalty of certain operations, which we’ll cover later in this chapter.

Figure 6.8 shows what would be considered a worst-case scenario, where the file system partition and the VMware vSphere VMFS partition are misaligned. In this case, for every three backend IOs, you get one productive IO. This could have the effect of causing each IO operation 3X latency, which is like getting 30% performance from your 100% storage investment. Fortunately, with Windows 2008 and above and with VMFS volumes that are created through VMware vCenter, this problem is much less likely.

Figure 6.8 File system and storage that is not correctly aligned.

Starting with Windows 2008, all partitions are aligned to the 1MB boundary. This means in almost all cases, they will be aligned correctly. The same is true with VMFS5 partitions created through VMware vCenter. They will align to the 1MB boundary. However, if you have an environment that has been upgraded over time, you may still have volumes that are not correctly aligned. The easiest way to check is to monitor for Split IOs in both ESXTOP or in Windows Performance Monitor.

Figure 6.9 shows reading of one frontend block will require only one backend IO operation, thus providing lower latency and higher IO performance.

Figure 6.9 File system and storage that is aligned.

SQL Server Buffer Pool Impact on Storage Performance

The Buffer Pool is a critical region of memory used in SQL Server, and it has a large impact on storage performance. The important thing to note from a storage performance perspective is that a larger Buffer Pool produces less read IO on your storage and lower transaction latency at your database. The Buffer Pool is a big read cache for your database. If you size it incorrectly or if the Buffer Pool is paged out by the operating system, you will start to experience performance degradations and a large amount of additional read IO hitting your storage. How the Buffer Pool is covered in detail in Chapter 7, including how to avoid Windows paging out the Buffer Pool when virtualizing business critical databases.

TIP

There is a direct tradeoff between allocating more memory to the SQL Server Database and its Buffer Pool to reduce read IO and allocating less memory and having more read IO. For your design, you need to consider which resource will be more of a constraint and a cost. In some situations, more memory for your database and for your vSphere hosts could be cheaper than purchasing more performance via your storage systems. However, server-side flash, which could be thought of as cheap memory rather than expensive storage, combined with smart software is impacting the economics of this equitation. We will discuss in more detail later in this chapter how using flash storage local to the server can allow you to consolidate more databases per host with less memory per database without degrading performance to an unacceptable level.

Updating Database Statistics

The SQL Server Query Plan Optimizer uses statistics compiled from tables to try and estimate the lowest cost execution path for a given query. By default, statistics are updated automatically at defined thresholds (refer to http://msdn.microsoft.com/en-us/library/dd535534%28v=sql.100%29.aspx), such as when 20% of a table changes since statistics were last gathered.

The Query Optimizer’s cost-based algorithm takes into account system resources such as CPU and IO to calculate the most efficient query, and overall table size and distribution of data. For example, it is better to join a three-row table to a million-row table, than to join a million-row table to a three-row table.

The cost to performance if the statistics are outdated and the impact on your storage can be high. Outdated statistics cause suboptimal query execution paths that can result in many more full table scans and therefore higher IO than would otherwise be required. For large databases that have hundreds of millions or billions of rows in a particular table, which can be common with SAP systems, the impact can be very severe. Therefore, it is important that you have up-to-date statistics.

CAUTION

When you upgrade an existing database to SQL Server 2012, the statistics may become out of date and result in degraded performance. To avoid this, we recommend you update statistics immediately after upgrading your database. To do this manually, you can execute sp_updatestats. Refer to http://www.confio.com/logicalread/sql-server-post-upgrade-poor-query-performance-w02/, which contains an excerpt from Professional Microsoft SQL Server 2012 Administration, published by John Wiley & Sons.

There are two primary methods to deal with the problem of outdated statistics impacting your database and storage IO performance.

Trace Flag 2371—Dynamic Threshold for Automatic Statistics Update

The first method involves using trace flag 2371 by setting startup option –T2371 or DBCC TRACEON (2371, -1). This is documented in Microsoft KB 2754171 (http://support.microsoft.com/kb/2754171). This trace flag tells SQL Server to dynamically change the percentage a table needs to change before the statistics are automatically updated. In very large tables, an automatic update of statistics can now be triggered by a change of less than 1%. Using this option could result in significantly improved performance for situations where you have very large tables.

CAUTION

Database statistics are complied against each table in your database. When SQL Server updates statistics, this information is recompiled. Automatic statistics update and trace flag 2371 may cause statistics to be updated more frequently than necessary. So there is a tradeoff between the performance benefit of doing statistics updates regularly and the cost of recompiling the statistics. The cost of doing this operation is not free, and in rare cases it can have a detrimental impact on performance. If you find any performance problems correlating to the periods of time where statistics are being updated, then you may wish to control when statistics updates occur. For the majority of customers we deal with, around 80% experience positive performance improvements and no detrimental impact by using the dynamic automatic updates for database statistics. Refer to http://technet.microsoft.com/en-us/library/ms187348.aspx.

Updating Database Statistics Using a Maintenance Plan

The second method for addressing out-of-date statistics is by using a maintenance plan. If you need more control over when database statistics updates occur, you can schedule a maintenance plan task for your databases. Ideally, the maintenance plan would be scheduled to happen when it would have the least impact on the database, and run only as frequently as needed. To determine when and how often it should run requires you to know your database workload patterns and to monitor query plan execution efficiency. Depending on your database, you may wish to schedule it to initially happen daily and adjust the schedule based on observed performance. Figure 6.10 shows the Update Statistics option in the Maintenance Plan Wizard. A full step-by-step example is provided in Chapter 11.

Figure 6.10 Maintenance Plan Wizard’s Statistics Update option.

Data Compression and Column Storage

Data Compression and Column Storage (also known as xVelocity memory optimized column store indexes) are features available only with SQL Server Enterprise Edition. They are not available in other editions. (See http://technet.microsoft.com/en-us/library/cc645993.aspx for a list of which features are supported in which SQL Server editions.) If you are licensed and using SQL Server Enterprise Edition, we would recommend you make use of these features where appropriate.

Data Compression

Data Compression was originally introduced in SQL Server 2008 and has improved markedly in 2012. One of the most important things to understand about Data Compression is that it’s not just about space savings, although the savings can be significant. Using Data Compression can also have a very positive impact on storage performance and Buffer Pool usage by reducing the number of IOPS and allowing the database to store more pages in memory in the Buffer Pool in compressed form. Using compression can also dramatically reduce query execution time, as fewer pages need to be read from cache or disk and analyzed for a given query.

In SQL Server 2012, you can choose to compress either a table or index using row or page compression. By default, when you choose page compression, it automatically does row compression at the same time. Based on our experience, space savings and performance improvements of up to 75% with SQL Server 2012 can be achieved in many cases. Data Compression can be used with both OLTP and OLAP type workloads, including Data Warehouse and Batch.

CAUTION

Data Compression introduces a CPU overhead and may increase CPU utilization on your database. In most cases, this overhead is outweighed by the benefit in performance you receive. In most virtualized environments, CPU performance will not be your constraint; memory and storage IO are usually the bottleneck. However, it won’t benefit every workload and is not likely suitable for small tables that change very often. The best workloads for data compression consist of large tables that are predominately read oriented. Also Data Compression can’t be enabled for system tables. Refer to http://technet.microsoft.com/en-us/library/cc280449.aspx and http://msdn.microsoft.com/en-us/library/dd894051(SQL.100).aspx.

Column Storage

Column Storage, also known as xVelocity memory optimized column store index, is a new feature of SQL Server 2012 aimed at data warehouse workloads and batch processing. Column Storage is much more space and memory efficient at storing and aggregating massive amounts of data. Leveraging this feature can greatly improve the performance of data warehouse queries. However, to use it you must make some tradeoffs.

When using Column Storage, you will not be able to use Large Pages and Lock Pages in Memory (trace flag 834) because this will increase the work the translation look-aside buffer (TLB, see Chapter 7) has to do. Also, the tables using the column store index will be read-only. Any time you need to write data to the table, you need to drop and re-create the column store index, but this can easily be done with scheduled batch jobs. For the types of workloads that Column Storage is well suited to, these tradeoffs are normally worth the benefits.

The benefits of Column Storage as documented in the link in the following tip include:

Index compression—Column Storage indexes are far smaller than their B-Tree counterparts.
Parallelism—The query algorithms are built from the ground up for parallel execution.
Optimized and smaller memory structures

From a storage perspective, the benefits of Column Storage are far less storage capacity and performance being required to achieve the desired query performance. The improvement in query performance ranges from 3X to 6X on average, up to 50X. See http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-components-postattach-ments/00-10-36-36-43/SQL_5F00_Server_5F00_2012_5F00_Column_2D00_Store.pdf.

Database Availability Design Impacts on Storage Performance

The database availability design you choose will have a direct impact on your storage performance. The choice between the different availability types varies. In this book, we are focusing mainly on standalone instances using VMware vSphere HA, SQL Server 2012 AlwaysOn Availability Groups (AAG), and AlwaysOn Failover Cluster Instances (FCIs). Standalone instances and FCI have relatively the same storage capacity requirements (unless local Temp DB is used in the FCI case) and have the same storage performance requirements. AlwaysOn Availability Groups, which has some advantages from an availability and data protection standpoint, at least doubles the total capacity requirements as well as adds additional IO overhead, dependent on the workload, and specifies how many inserts, updates, and deletes there are, as each database change must be replicated.

Volume Managers and Storage Spaces

When you set up storage within Windows, you have the option of using the Windows Volume Manager with Basic or Dynamic Disks or using Storage Spaces (Windows 2012 onwards). Dynamic Disks and Storage Spaces provide options that include spanning volumes, striping volumes, and fault-tolerant volumes inside the guest operating system. Managing spanned, striped, or fault-tolerant volumes inside Windows adds an unnecessary IO overhead when you are virtualizing your SQL Server—especially as you are being provided with these services through your underlying storage devices and your virtual disks can be expanded online without disruption. Because of the way that SQL Server manages its data files, and effectively accesses them and stripes the files anyway, there is no need to add any additional layers of striping for performance inside Windows. We recommend the use of basic disks in Windows and the GPT (GUID Partition Table) partition format for all SQL Server partitions. Using GPT, you will be able to expand the partitions beyond 2TB in the future (vSphere 5.5 or above required) if the need arises. As a best practice, you should configure only one partition per virtual disk for your database.

SQL Server Virtual Machine Storage Design

We have now covered how to optimize storage performance for SQL Server and Windows at the operating system level. Now we will look at how to optimize storage performance with your virtual machine template and discuss the different configuration options available to you. In this section, we cover different virtual machine hardware versions, virtual IO controllers, types of virtual disk, and how to size and deploy your virtual disks onto your storage devices. In this section, we start to look further at IO device queues and how they impact virtual machine storage performance.

Virtual Machine Hardware Version

The virtual machine hardware version you choose will impact the type of virtual storage IO controllers available to your virtual machine. The type of virtual storage IO controller you choose will have an impact on your storage performance, as we will cover in the next section. Table 6.3 shows the different storage IO controller options based on the different virtual hardware versions.

Table 6.3 Supported Virtual Machine Storage IO Controllers

Feature	ESXi 5.5 and Later	ESXi 5.1 and Later	ESXi 5.0 and Later	ESXi 4.x and Later	ESXi 3.5 and Later
Hardware Version	10	9	8	7	4
Maximum SCSI Adapters	4	4	4	4	4
Supported SCSI Adapters	BusLogic LSI Parallel LSI SAS PVSCSI	BusLogic LSI Parallel LSI SAS PVSCSI	BusLogic LSI Parallel LSI SAS PVSCSI	BusLogic LSI Parallel LSI SAS PVSCSI	BusLogic LSI Parallel
SATA (AHCI) Controllers	4	N	N	N	N

Each virtual SCSI controller (vSCSI) allows up to 15 disks to be connected, for a total of 60 vSCSI disks per virtual machine. With hardware version 10 in ESXi 5.5, VMware has introduced a new SATA (AHCI) controller. Each SATA controller allows up to 30 disks to be connected, for a total of 120 vSATA disks per virtual machine. vSCSI and vSATA can be combined on the same virtual machine for a maximum of 180 virtual disks per VM.

BusLogic and LSI Parallel are legacy controllers not suitable for SQL Server 2012 virtual machines. If you are virtualizing on ESXi 5.5 and using virtual hardware version 10, SATA may have some use if you have a need for a particularly large number of virtual disks per VM. However, for almost all cases, you will choose either LSI Logic SAS or VMware PVSCI (Paravirtualized SCSI). The reason why will become clear as we look in more detail at each of these controllers.

Choosing the Right Virtual Storage Controller

A virtual storage controller is very much like a physical storage controller in terms of how Windows interacts with it. Choosing the right virtual storage controller can have a big impact on your performance. In this section, we cover the different controller options and the performance characteristics of each type. This will help you make the right virtual storage controller choice based on your requirements and constraints.

Remember back to Principle 1: Your database is just an extension of your storage? Our goal is to maximize the parallelism of IOs from SQL Server and Windows through to the underlying storage devices. We don’t want IOs to get held up unnecessarily in Windows itself. However, care needs to be taken because we don’t want to issue so many IOs that the backend storage devices get overloaded and cause additional latency. Overloading your backend storage will not just impact SQL Server, but could also impact all of your other VMs. We will discuss later how you can use features of VMware vSphere to ensure quality of service to your critical databases and other virtual machines to protect them from the impact of noisy neighbors.

Table 6.4 outlines some of the key attributes of the storage controller options for your SQL Server VMs.

Table 6.4 Supported Virtual Machine Storage IO Controllers

Feature	LSI Logic SAS	VMware PVSCSI	SATA AHCI^*
Maximum Disks per Controller	15	15	30
Default Adapter Queue Depth	128	256	N/A
Maximum Adapter Queue Depth	128	1,024	N/A
Default Virtual Disk Queue Depth	32	64	32
Maximum Virtual Disk Queue Depth	32	256	32
AlwaysOn Failover Cluster Instance Supported	Y	N	N
AlwaysOn Availability Group Supported	Y	Y	Y

VMware’s Paravirtualized SCSI controller (PVSCSI) is a high-performance SCSI adapter that allows the lowest possible latency and highest throughput with the lowest CPU overhead. In VMware vSphere 5.x, PVSCSI is the best choice, even if you don’t expect your database to be issuing large amounts of outstanding IO operations. However, like SATA, PVSCSI can’t be used with SQL Server AlwaysOn Failover Cluster Instances, which leverage shared disk clustering. When you are using AlwaysOn Failover Cluster Instances, your only option is LSI Logic SAS.

CAUTION

Changing the storage controller type after Windows is installed will make the disk and any other devices connected to the adapter inaccessible. Before you change the controller type or add a new controller, make sure that Windows contains the necessary drivers. On Windows, the driver must be installed and configured as the boot driver. Changing the storage controller type can leave your virtual machine in an unbootable state, and it may not be possible to recover without restoring from backup.

Choosing a virtual storage controller with a higher queue depth will allow SQL Server to issue more IOs concurrently through Windows and to the underlying storage devices (virtual disks). By having more virtual disks (more drives or mount points), you increase the amount of queues that SQL Server has access to. Balancing the number of data files to drive letters, to virtual disks, and to adapters allows you to maximize the IO efficiency of your database. This will reduce IO bottlenecks and lower latency.

Not all virtual disks will issue enough IOs to fill all of the available queue slots all of the time. This is why the adapter queue depths are lower than the aggregate total number of queues per device multiplied by the total number of devices per adapter. PVSCSI, for example, has 15 virtual disks, and each disk has a queue depth of 64 by default. The number of devices multiplied by their queue depth would be 960, even though the adapter default queue depth is only 256.

In most cases, the default queue depths are sufficient for even very high performance SQL Server systems—especially when you are able to add up to four vSCSI adapters and increase the number of virtual disks per adapter. With LSI Logic SAS, you have a maximum of 32 queue slots per disk and a maximum of 128 queue slots per adapter. Neither can be changed. In this case, your only option to scale IO concurrency is by adding virtual controllers and adding virtual disks. This is a key consideration when considering AlwaysOn Failover Cluster Instances, where LSI Logic SAS is the only vSCSI adapter option.

With PVSCSI, you can modify the disk queue depth and the adapter queue depth from their default settings. This is only required in very rare cases where your database needs to issue very large amounts of IO in parallel (>1,000). To keep things standardized and simple, we recommend leaving the default settings in your templates and only modify them if absolutely necessary. This assumes your underlying disk subsystems can support the parallelism required at low-enough latency.

Figure 6.11 shows an example of the registry entries configured to increase the maximum adapter and virtual disk queue depths for a VMware PVSCSI adapter, as documented in VMware KB 2053145.

Figure 6.11 PVSCSI advanced registry parameters.

If you choose not adjust the queue depth or are unable to adjust the queue depth of a particular storage device or adapter, Windows will queue any additional IOs. Windows will hold up to 255 IOs per device before issuing them to the adapter driver, regardless of the devices underlying queue depth. By holding the IOs in the Windows OS before issuing them to the adapter driver and the underlying storage, you will see increased IO latency. To learn more about the Windows storage driver architecture (storport), we recommend you read the article “Using Storage Drivers for Storage Subsystem Performance” at Windows Dev Center [http://msdn.microsoft.com/en-us/library/windows/hardware/dn567641 and http://msdn.microsoft.com/en-us/library/windows/hardware/ff567565(v=vs.85).aspx].

Figure 6.12 shows the difference in IOPS and latency between PVSCSI, LSI Logic SAS, and SATA AHCI. These tests were conducted using a single drive at a time on a single VM. The VM was configured with two vCPUs and 8GB RAM. Each virtual disk was placed on the same VMFS5 data store on top of a Fusion-io ioDrive2 1.2TB PCIe flash card. IOMeter was used to drive the IO load and measure the results.

Figure 6.12 Virtual storage adapter performance.

As you can see from the graph in Figure 6.12 (published at http://longwhiteclouds.com/2014/01/13/vmware-vsphere-5-5-virtual-storage-adapter-performance/), both SATA and LSI Logic SAS have no significant performance advantage going from 32 outstanding IO operations (OIO) to 64 due to their maximum device queue depth being 32. PVSCSI, however, sees a 15% improvement in IOPS between 32 OIOs and 64, based on a single Fusion-io ioDrive2 card as the underlying storage. A storage array of multiple backend devices will potentially show a much greater improvement when queue depth is increased. This assumes the HBAs and storage processors are configured to accept a higher queue depth and not overloaded.

Table 6.5 displays the IOMeter performance results for each virtual storage adapter, including throughput and CPU utilization for the 8KB IO size. The IO pattern used was 100% random read, with a single worker thread and single virtual disk from the test virtual machine. As you can see from the results, PVSCSI shows significantly better IO performance at lower latency and lower CPU utilization compared to the other adapter types.

Table 6.5 Virtual Storage Adapter Performance (32 OIOs)

Virtual Storage Adapter	IOPS	Avg. Latency (ms)	Throughput (MB/s)	CPU Utilization
SATA	16581.65	1.93	129.54	52%
LSI Logic SAS	45212.96	0.71	353.23	45%
PVSCSI	54382.29	0.59	424.86	33%

Table 6.6 displays the IOMeter performance results of increasing the outstanding IOs from 32 to 64 to issue more IOs in parallel using otherwise similar test parameters. As was the case with the previous test, PVSCSI shows significantly improved performance compared to the other adapter types.

Table 6.6 Virtual Storage Adapter Performance (64 OIOs)

Virtual Storage Adapter	IOPS	Avg. Latency (ms)	Throughput (MB/s)	CPU Utilization
SATA	17395.53	3.65	135.90	54%
LSI Logic SAS	44933.46	1.42	351.04	43%
PVSCSI	62845.71	1.02	490.98	34%

This test also shows that a single virtual machine on a vSphere 5.5 host with a single virtual disk can provide good performance and throughput, provided the underlying storage system can support it. Using PVSCSI with vSphere 5.1 and above, a single virtual machine can support up to one million IOPS at 4KB IO size (see http://blogs.vmware.com/performance/2012/08/1millioniops-on-1vm.html).

CAUTION

The IOMeter performance results included in this section were created only to show the relative difference in performance capability between the different virtual storage adapter types. Your results will be different. These tests did not use real-world workload patterns and should not be relied upon for sizing or capacity planning of your SQL Server databases. You should conduct your own tests to validate your environment. See Chapters 10 and 11 for details of how to validate and baseline your environment.

Choosing the Right Virtual Disk Device

You have a small number of options when choosing a virtual disk type for SQL Server, and the choice you make may have an impact on performance. However, modern storage systems and advancements in hypervisor technology have equalized the performance aspects of different virtual disk types in a lot of cases. Today, the type of storage you’re using and your availability design will largely drive your virtual disk selection. Table 6.7 outlines the different virtual disk options.

Table 6.7 Virtual Disk Types

Virtual Disk	Zero on First Write	Size Limit	SQL Failover Cluster Instance Across Boxes Supported
Thin	Y	2TB–512B 62TB (ESXi 5.5)	N
Thick – Lazy Zero	Y		N
Thick – Eager Zero	N		N^*
Virtual Raw Device Map (vRDM)	N		N
Physical Raw Device Map (pRDM)	N	64TB	Y
PassThrough PCI or VM DirectPath IO	N	Device Limit	Y^**

Thin Versus Thick Lazy Zero for SQL Server

The major difference between Thin and Thick Lazy Zero disks is that Thin disks are not preallocated and start small and grow on demand, whereas Thick Lazy Zero disks are preallocated. The unit of growth of a Thin disk is the VMFS block size, which is usually 1MB for VMFS5, unless the data store was upgraded form VMFS3. On a very active VMFS data store, there is the possibility that as the Thin disk grows, the blocks allocated will not be contiguous, but in most cases, this will not negatively impact performance.

There is a myth that the performance of Thick Provisioned disks, even if using Lazy Zero, is much better than a Thin Provisioned virtual disk (VMDK). This is not the case. Thin Provisioned disks and Thick Provisioned Lazy Zero disks have similar performance characteristics. This is because each time a block of data is initially written to either a Thin or Thick Lazy Zero disk, the block must first be zeroed out. This magnifies the write IO impact of blocks that have not previously been written because two write IOs will be issued by the ESXi host for each block. This may have a noticeable negative impact on write IO latency, depending on your underlying storage.

The reason to choose between Thin or Thick Lazy Zero therefore has little to do with performance and more to do with manageability and efficiency of storage consumption. There are tradeoffs to each choice. Your choice needs to be based on your requirements and circumstances.

Using thin provisioning may allow for higher utilization of storage capacity as each VMDK, data store, and underlying storage device will have a percentage of free space unused as a safety buffer. At the same time, it will add additional management overheads to the environment because administrators have to ensure they do not excessively over-provision real storage resources in terms of both capacity and performance. You need to be aware of possible growth storms and keep on top of your storage demands.

CAUTION

Thin Provisioned VMDK growth operations on VMFS data stores generate metadata updates. Each metadata update requires a lock for a brief period of time on the VMFS data store. On some older storage arrays that do not support VMware’s API for Array Integration (VAAI) and where there is an excessive number of Thin VMDKs or VMs per data store, this can cause SCSI reservation conflicts, which may result in degraded performance (additional latency). VMFS5 volumes newly created on arrays that support VAAI will use Atomic Test and Set Locking (ATS) Only. ATS addresses the problems that used to be caused by SCSI reservation conflicts. When selecting a storage array for use with VMware vSphere 5.x and SQL Server, you should ensure it supports VAAI. VMFS5 volumes that were upgraded from VMFS3 may fall back to using SCSI reservations in certain cases. See VMware KB 1021976 and http://blogs.vmware.com/vsphere/2012/05/vmfs-locking-uncovered.html.

The capacity savings from thin provisioning may well be enough to justify the management overheads because you are able to purchase more on demand instead of up front, and this could save a considerable amount of money. But you need to make sure you can get the performance you need from the capacity that has been provisioned and is used. Sizing for performance may necessitate much more capacity is provisioned on the backend storage devices and therefore diminishes any savings that may have been had when saving capacity through thin provisioning.

CAUTION

Restoring files from backup or copying files between VMs that have Thin Provisioned VMDKs will cause those disks to expand. Once the disks are expanded, they do not shrink automatically when the files are deleted. Also, since Windows 2008, if you do a Full Format on a Thin Provisioned VMDK, it will cause the disk to inflate, as a full format will zero out each block. If you use Thin Provisioned disks, you should select the quick format option when partitioning a disk in Windows. We strongly recommend that you don’t over-provision storage resources to the point an out-of-space (OOS) condition could result from unexpected VMDK growth. See VMware KB 1005418 and Microsoft KB 941961.

If you don’t use Instant File Initialization, then SQL Server will zero out its data files whenever they are created or extended. This will ensure you get optimal performance from the data files regardless of the underlying virtual disk format. But this comes at the cost of the time taken to zero out the file and the resulting impact in terms of storage IO to the underlying storage. As previously discussed, using Instant File Initialization allows SQL Server to act as part of Windows and not write a zero to a block before data is written to it. In certain cases, there could be substantial storage efficiencies (IO Performance and Capacity) by combining the use of Instant File Initialization, thin provisioning, and SQL Server compression. This may be especially advantageous to development and test environments. There can be a significant performance penalty if you use a non-VAAI array without using SQL Instant File Initialization on Thin and Thick Lazy Zero disks. VAAI allows the zeroing operation to be offloaded to the array and performed more efficiently, thus saving vSphere resources for executing VMs. If you use Thin Provisioned or Lazy Thick VMDKs without a VAAI-compatible array, the entire zeroing operation has to be handled solely by vSphere.

If your SQL Server and environment meets the following requirements, you may want to consider using Thin Provisioned VMDKs with Instant File Initialization and SQL Data Compression:

The SQL Server workload will be largely read biased.
Performance from your storage during times that blocks are initially written to and zeroed out is sufficient to meet your database SLAs.
Performance is sufficient from the capacity required when thin provisioning is used.
You are not planning to use Transparent Data Encryption.
You wish to minimize the amount of storage provisioned up front and only purchase and provision storage on demand.

When you are using Thick Provisioning Lazy Zero (the default), the VMDK’s space is allocated up front by vSphere, although like with thin provisioning, it is not zeroed out until it’s written to for the first time (or you select full format in Windows when partitioning the disks). When you look at the data store, you may get a more accurate view of free space and there may be less variance between provisioned space and usage. The reason we say you may get a more accurate view of free space is that many modern arrays will tell vSphere the storage is allocated or consumed but won’t actually do so until data is written to it, although it most likely will be reserved.

If you were considering using Thin or Thick Lazy Zero VMDKs for SQL Server, we would recommend you choose the default of Thick Lazy Zero to minimize management overheads. We would recommend using Thin where there are requirements that would benefit from it and justify the management overheads. However, before you decide on Thick Lazy Zero, you should consider Thick Eager Zero, which we cover next.

Using Thick Eager Zero Disks for SQL Server

The major difference between Thick Eager Zero and Thick Lazy Zero or thin provisioning is when the blocks on the VMDK are zeroed out. As we’ve covered with Lazy Zero and Thin VMDKs, blocks are zeroed on first write. With Eager Zero, the blocks are zeroed when the VMDK is created as part of the provisioning process. This means all blocks are pre-zeroed before Windows or SQL goes to access them. By doing this, you are eliminating a first write penalty in the situations where that would occur. This ensures there is no double write IO required to the VMDK after it is provisioned.

As you can imagine, it can take quite a bit longer to provision Thick Eager Zeroed disks. Additionally, provisioning and zeroing out the blocks may impact the performance of other VMs when using shared storage devices. The impact to your environment will be dependent upon the type and utilization of your backend storage devices. Some storage arrays will just throw away or ignore the zeros, and in these cases, the provisioning operations will complete relatively quickly and have minimal impact on performance.

In aggregate, over the life of a VMDK there is normally little difference in the amount of IO generated when using Thin, Thick Lazy Zero, or Thick Eager Zero VMDKs. The difference is all about the timing of when the IO is generated, either up front (in the case of Eager Zero) or on demand (first write) with Thick Lazy Zero and Thin. Once a block has been written to with Thick Lazy Zero or Thin, it has exactly the same performance characteristics as if it were Eager Zeroed. However, with Eager Zero, even if a block is never used, you have zeroed it out at the cost of a write IO operation.

TIP

When provisioning VMDKs for data files and transaction log files, we recommend you size them to allow 20% free space, which allows for any unforeseen required growth. There should be sufficient capacity for the predicted workload over at least a three-to-six-month period. By right-sizing VMDKs and holding data files and transaction log files for a reasonable period, you reduce the management and administrative burden while at the same time optimize overall performance and capacity consumption.

If you are proactively managing SQL Server data and transaction log files, and not using Instant File Initialization, then the performance of your virtual machine will be the same regardless of the virtual disk type you select. This is because SQL Server is zeroing out the blocks first before they are used. If you enable IFI, then Eager Zero will give better performance in terms of lower latency compared to Thick Lazy Zero or Thin, but only when the block is first written to. All subsequent writes or access to the same block will have exactly the same performance characteristics.

Although the aggregate amount of IO may be similar between the different virtual disk options, Eager Zero generally provides the more predictable response times because IOs will not be impacted by the additional write operation when data is written to a new block. This predictability of IO response and generally lower latency is why Eager Zero is required for the non-shared disks of a SQL Server Failover Cluster Instance. Increased latency or poor IO performance can cause unnecessary cluster failovers between nodes.

With the advent of VMware’s VAAI and modern arrays that support it, the impact to the environment of zeroing operations is reduced and therefore the performance impact of using Eager Zero Thick disks is also reduced during initial provisioning. If you were previously thinking of using Thick Lazy Zero VMDKs and you have a VAAI-capable array that supports the Write Same primitive, we would recommend you use Thick Eager Zero instead. This provides lower management overheads and optimal performance. Regardless of whether you are using IFI or not, and in spite of the possible overhead of having written zeros to a block that may not be used, we feel this is justified for the decreased latency and increased predictability of IO responses that are provided to SQL Server. This is especially important for business-critical production databases. It is fine to use Thin or Thick Lazy Zero for your Windows OS disk, while using Eager Zero Thick for your database drives (data files, Temp DB, and transaction logs). When using SQL AlwaysOn Failover Cluster Instance, it is recommended that you configure Windows OS disks as Eager Zero Thick; shared LUNs will in this case be configured as physical RDMs.

Figure 6.13 shows a sample configuration of a virtual disk with the selection of Thick Provision Eager Zeroed.

Figure 6.13 New virtual disk—Thick Provision Eager Zeroed.

Using Raw Device Maps for SQL Server

A Raw Device Map (RDM), as the name suggests, is a direct mapping of a physical LUN to a virtual machine. The main reason to choose RDM is SQL Server Failover Cluster Instances (FCI). SQL FCI uses Windows Failover Clustering (previously known as Microsoft Cluster Services), shared SCSI bus, shared disks between nodes, and requires persistent SCSI reservations. To allow the persistent SCSI reservations and the cluster to function correctly, Physical Mode or Pass-through RDM (pRDM) are required. Another reason to consider using RDMs for SQL Server is if you are leveraging physical SAN capabilities such as snapshots that you wish to present to a physical server for a purpose such as LAN-free backups, if you are not using a backup solution integrated with the VMware APIs for Data Protection. However, there are no noticeable performance differences between RDMs and virtual disks on a VMFS file system, as Figure 6.14 illustrates.

Figure 6.14 VMFS and RDM performance comparisons: IOPS vs. IO size.

Figure 6.14 illustrates the performance comparison between VMFS and RDM using a random 50/50 mixed read-write workload pattern and the different IO sizes based on data published at http://www.vmware.com/files/pdf/performance_char_vmfs_rdm.pdf.

Although there are no noticeable performance differences between a single VMDK on a VMFS data store and an RDM, there are important performance considerations and constraints with using RDMs that need to be considered, such as:

An RDM maps a single LUN to a virtual machine, so each VM will likely consume multiple LUNs and there will be more LUNs to manage.
More LUNs are required, which may constrain the number of VMs possible as the maximum number of LUNs per host is currently 256.
It is not possible to perform storage IO quality of service on a pRDM; therefore, a VM configured to use a pRDM could negatively impact the performance of other VMs using the same underlying shared storage array.
Can’t leverage vSphere features such as Storage vMotion, so it can be more difficult to balance capacity and performance when using pRDMs and more difficult to resolve any storage hot spots.

Due to the management overheads, constraints, and VMware feature limitations of using RDMs, we recommend their use only when absolutely necessary, such as to deploy SQL FCI; in all other cases, VMDKs should be used. Using VMDKs future proofs your environment and allows it to benefit from any further advancements that VMware releases that pertain to VMDKs.

The IO Blender Effect

When you virtualize SQL and consolidate many SQL VMs onto fewer hosts, the amount of IO per host increases. In addition to the increase in IO per host, in most cases the IO patterns will also change. Unless you have completely dedicated storage for each SQL VM, which is not optimal from a cost or performance perspective in most cases, all IO will at some point become random.

Any time you share storage and have multiple VMs and different IO workloads, the combined IO pattern is random. Random IO, especially when write biased, can have a significant impact on storage performance, particularly when RAID (Redundant Array of Inexpensive or Independent Disks) is used. Grouping similar workloads together can help improve the blended IO pattern and reduce the burden on storage. Figure 6.15 shows the impact of combining different IO workload patterns.

Figure 6.15 The IO Blender Effect.

This is an important concept to understand because you will need to size your storage to be able to handle the required number of IOPS with a completely random IO pattern. Random IO has a higher overhead than sequential IO in most cases, with the exception of some flash-based storage systems. Subsequent sections of this chapter will discuss IO workload performance impacts of different physical storage systems in more detail.

SQL Virtual Machine Storage Layout

Now that we have covered the various storage IO controller and virtual disk device choices, we can put it all together and discuss the design of a logical virtual machine storage layout. This layout, in turn, supports our SQL Server and Windows design and will drive the design of our underlying storage devices. We want to take our five key principles and apply these so our virtual machine storage layout meets the requirements of our database workloads in the simplest way possible, without compromising SLAs.

The example in Figure 6.16 shows a simple storage layout for a SQL Server VM that has all of its VMDKs supported by a single underlying data store. You could also have a number of SQL Server VMs and their VMDKs on the same data store. For development and test VMs, and where SQL FCI is not used, this may be a suitable design choice. It would also be suitable for the storage of your SQL Server template VM. However, it is unlikely to be a suitable choice for high-performance business-critical production SQL Server databases. The Windows C: drive, application binaries, and page file may be on the same data store or hosted on another data store.

Figure 6.16 Multiple VMDK to a single data store.

The performance of SQL in this example will be limited to the performance of a single data store, and it will have access to the queue depth of a single data store, even though the individual virtual disks may be trying to issue many IOs in parallel. This example is the simplest from a management perspective, though, because there is only a single data store to manage. This sample layout assumes that not all of the virtual disks will be issuing IOs at the same time and that the aggregate amount of outstanding IOs will not exceed the available queue depth of the data store. If the available queue depth of the data store and the underlying storage device is exceeded, the result will be additional IO latency in the hypervisor and slower response times for your database. Another impact of this choice is that all IOs from SQL will be blended together and become completely random, as we show in the “IO Blender Effect.”

The example in Figure 6.17 shows two VMDKs per data store. This layout may be suitable for production SQL databases, provided the underlying data store could support the performance requirements of the VMDKs. This assumes that the data store has sufficient queue depth for the peak number of outstanding or parallel IOs from the VMDKs; otherwise, additional latency will result and response times will be degraded. SQL will benefit from the combined IO performance of multiple data stores and the queue depth available from multiple data stores to allow many IOs to be serviced in parallel.

Figure 6.17 Multiple VMDK and multiple data stores.

This sample layout provides a balance between manageability and performance, because there are a relatively small number of data stores to manage per SQL VM, less than would be required for a physical SQL system or where RDMs are used. This is quite a common layout for SQL systems that have reasonable but not extreme performance requirements. The data store that holds the transaction logs would blend the otherwise sequential IO patterns and make them random. If this was a concern, the transaction log VMDKs could be separated onto their own data stores.

TIP

We recommend the VMDK for the Windows OS C: drive, application binaries, and Windows page file be hosted on a separate data store. Because the IO from the Windows OS C: drive, application binaries, and page file should be minimal, you may be able to host a number of them on a single data store, while keeping data files and transaction log disks and their data stores separate. You should take into account your availability requirements and risks, as the loss of access to a single data store in this case could impact multiple SQL systems. Backup disks can be shared with the same IO controller as the OS, and we recommend they are on their own VMDK and data store if their size and performance requirements justify it.

The example in Figure 6.18 shows each VMDK mapped to a dedicated data store. This layout is suitable for SQL systems that need extreme IO performance and scalability. It allows IO to be spread across more storage devices, and each VMDK has access to the maximum possible amount of parallel IO. The increased number of data stores and therefore LUNs will limit the total number of VMs that can be supported per host. You will have many more data stores to manage per VM, which will increase your management overheads.

Figure 6.18 Single VMDK per data store.

If each SQL VM has 10 data stores, you could be limited to just 25 VMs per cluster, as each data store should be zoned to each host in the cluster to allow VMware HA and DRS to function correctly. It is likely that if you need this layout for storage performance, your SQL VMs will also have very large compute footprints in terms of memory and CPU. However, if this is required to meet your performance requirements, you may find that you need to design for a smaller number of hosts per cluster, and potentially have more clusters. This layout assumes that each VMDK will use the full queue depth of each data store, which is often not the case. You may find that you need to reduce the queue depth per LUN to avoid overloading your backend storage ports, which defeats the purpose of having more LUNs in the first place.

Often the need for extreme performance is driven by many database instances or schemas running on a single VM, and in these cases it may be a better design choice to split up those instances into multiple VMs. Because VMDKs (not RDMs) are used, it is possible to start with the example in Figure 6.19 and increase the number of data stores if required at a later time. You can migrate the VMDKs without any downtime by using VMware Storage vMotion.

Figure 6.19 Virtual machines sharing data stores.

Up until now we have provided examples where the storage is dedicated to each SQL Server. This is a very traditional approach to SQL storage architecture. When you have a very good baseline and understanding of your inventory and workload characteristics, it is a good approach, but it has a couple of potential drawbacks. The first drawback is manageability. You must have a number of data stores supporting each VM, which produces more data stores to manage, and may not balance performance and capacity efficiently between many SQL Server VMs. You may end up with many different data store sizes for each of the different databases, which provides little opportunity for standardization. This may be more of a problem in a smaller environment because there may be fewer SQL VMs of similar size; in large-scale environments (hundreds of SQL VMs), this is generally less of a problem.

The next potential drawback is that although you may have isolated the storage logically to each VM, if you share the same storage under the covers, each VM could impact the performance of the others. When a single VM is using a storage device, you can’t make use of VMware vSphere features such as Storage IO Control (SIOC) to ensure quality of service and fair IO performance between different VMs. This may place an additional burden on storage administrators to try and isolate performance at the physical storage level, which can often lead to limited and suboptimal overall performance.

Finally, the isolation approach doesn’t lend itself easily to automation and policy-based administration. It is also not possible to dedicate storage devices to SQL Server VMs in this manner in most Cloud or Infrastructure as a Service environments. To make automation and policy-based administration possible, you need standardization and you need to share multiple data stores among many VMs. This then allows you to leverage the features of VMware vSphere to ensure quality of service and fairness of IO performance between the many SQL VMs if there is any contention.

The example in Figure 6.19 shows two SQL Server VMs sharing the same data stores for the different types of Windows OS and SQL disks. This layout allows the SQL VM’s performance to be balanced with a standardized data store size and allows for easier automation and policy-drive provisioning and load balancing. Because the data stores are shared, VMware Storage IO Control can ensure fairness of IO and quality of service for IO performance between the multiple SQL VMs.

SQL Failover Cluster Instance Storage Layout

In this section we have shown how you can efficiently lay out your virtual machine storage for SQL and use fewer LUNs than you have VMDKs, while balancing performance requirements. This is possible when using standalone instances or when using AlwaysOn Availability Groups. However, when using SQL AlwaysOn Failover Cluster Instances, you must use pRDMs and therefore bypass the VMFS data store and the ability to share LUNs, as Figure 6.20 illustrates.

Figure 6.20 SQL AlwaysOn Failover Cluster Instance storage layout.

For this reason and for reduced management overheads and complexity, we recommend the use of AlwaysOn Availability Groups over Failover Cluster Instances where possible.

Expanding SQL Virtual Machine Storage

When designing your virtual machine storage architecture for SQL Server, you need to consider how the allocated storage will be expanded in the future as your databases grow. We previously discussed in “SQL Server File System Layout” the ability to expand partitions and virtual disks online without disruption. This is one way of expanding the storage available to SQL Server. An alternative approach would be to hot-add additional virtual disks to SQL Server and then balance the data files across the additional disks.

If you hot-add new disks and need to create new data files, SQL Server will stripe the data to the newly created data files as they have the more free space. For this reason, we recommend you add more than one virtual disk and data file to try and spread the IO load. This will help avoid creating hot spots. The number of VMDKs and data files you need to create will depend on your SQL workload profile.

Jumbo VMDK Implications for SQL Server

vSphere 5.5 introduced the ability to provision 62TB Jumbo VMDKs and Virtual Mode RDMs (vRDM) with a VM. Physical Mode RDMs (pRDM) are capable of being provisioned up to 64TB, as of vSphere 5.0. The VMware maximum VMFS data store size is 64TB, as it was in vSphere 5.0. This allows truly massive storage footprints to a single VM.

With Virtual Hardware Version 10, we now have the ability to provision a single VM with maximum storage capacity (see Table 6.8).

Table 6.8 Maximum Virtual Machine Storage

Virtual Disk Controller	Max # Controllers	VMDK/ Controller	Total VMDKs	Capacity VMDK/vRDM	Capacity pRDM
vSCSI Controller	4	15	60	3,720TB	3,840TB
SATA Controller	4	30	120	7,440TB	7,680TB
Totals	8		180	11,160TB (11PB)	11,520TB (11.5PB)

Just because the size of the virtual disk increases doesn’t mean the performance of the virtual disk increases. With each virtual disk, the queue depth is still the same regardless of the size. This limits the parallelism of IOs to a single virtual disk, and it will also limit the throughput unless SQL is issuing incredibly large IO sizes. For this reason, the maximum capacity is largely theoretical because you would not be able to get the necessary performance.

Although having lots of 62TB virtual disks is unrealistic, having a few virtual disks > 2TB is possible and potentially desirable for large SQL Servers. You can use a single virtual disk for your transaction logs (max 2TB per transaction log file), and you would be able to use a single virtual disk for your backup drive. Both transaction logs and backups are sequential in nature and could benefit from the capacity of a larger > 2TB VMDK without the performance drawbacks that would be likely for data files. Your underlying storage platform would need to support a VMFS data store of a LUN size big enough to support all of these large VMDKs. You should also consider your restore times when using large VMDKs. If you can’t restore a large VMDK within your SLAs, it is not a good choice. Just because you can use Jumbo VMDKs doesn’t mean you always should.

CAUTION

You can’t extend virtual disks > 2TB online. You must shut down your virtual machine first, and extend the virtual disk offline through the vSphere Web Client. This is due to the disk needing to be in the GPT format. Once a virtual disk has been extended to > 2TB, each time you need to extend it further, you must shut down the VM. Alternatively, you can hot-add a new virtual disk to the VM online at any time and the new virtual disk can be > 2TB. Jumbo VMDKs can only be managed through the vSphere Web Client because the traditional VI Client (VMware C# Client) only supports VMware vSphere 5.0 features. All newer features are only available through the Web Client. We recommend you create all SQL data file, Temp DB file, transaction log, and backup drives using the GPT format.

VMFS Heap Size Considerations with Monster VMs and Jumbo VMDKs

ESXi 4.x and 5.x prior to 5.5 used a VMFS Heap value to control how much memory was consumed to manage the VMFS file system and for open or active VMDK capacity on a single ESXi host. This limit was not documented in the vSphere Maximum’s product document, and by default with a 1MB block size on ESXi 5.0 GA, it would limit a host to being able to open 8TB of total VMDKs before errors could occur. The maximum on ESXi 5.0 GA was 25TB with a 1MB block size, which required adjusting the advanced parameter VMFS3.MaxHeapSizeMB. This was later increased to 60TB by default on ESXi 5.0 by applying the latest patches and in ESXi 5.1 Update 1. The only downside of this was 640MB of RAM was consumed for the VMFS Heap.

In vSphere 5.5, the whole VMFS Heap size problem has been addressed. The VMFS Heap is now irrelevant as a measure of how much open and active VMDK capacity a single ESXi 5.5 host can handle. This is due to major improvements in the way the VMFS Heap and pointer blocks are managed.

VMFS pointer blocks are a pointer to a VMFS block on disk. When a VMDK is opened on an ESXi 5.5 host, all of the VMFS “pointer” blocks are cached in the Pointer Block Cache, which is not part of the main VMFS Heap (where the pointer blocks were previously stored in prior versions of ESXi). This allows the open VMFS “pointer” blocks to be addressed or accessed and managed as fast as possible without having to access metadata from the VMFS file system directly. The pointer blocks will remain in use so long as a VMDK or other file is open. However, many blocks in any individual VMDK are not often active. It’s usually only a percentage of the blocks that are actively used (say, 20%). The images shown in Figures 6.21 and 6.22 display how the pointer blocks are used to refer to data blocks on the VMFS file system. Each pointer block that is active is stored in the pointer block cache to ensure the fastest possible access to the most frequently used blocks.

Figure 6.21 VMFS pointer block indirection—memory address mapping to physical VMFS blocks. *¹

Figure 6.22 VMFS pointer block double indirection. Used for mapping very large VMFS data sets.*

Pointer Block Eviction Process

This is where the new Pointer Block Eviction Process introduced in ESXi 5.5 comes in. If the number of open and active VMFS blocks reaches 80% of the capacity of the Pointer Block Cache, a Pointer Block Eviction Process will commence. This basically means the pointer blocks that are not active, or least active, will be evicted from memory and only the active blocks will remain in the cache. This new process greatly reduces the amount of ESXi host memory consumed to manage VMFS file systems and the open VMDKs capacity per host. The VMFS Heap itself in ESXi 5.5 consumes 256MB of host RAM (down from 640MB), and the Pointer Block Cache by default consumes 128MB of host RAM. You no longer have to worry about adjusting the size of the VMFS Heap at all. A new advanced parameter has been introduced to control the size of the Pointer Block Cache, MaxAddressableSpaceTB.

As with all advanced parameters, you should not change MaxAddressableSpaceTB without a good justification, and in most cases, it will not be necessary. MaxAddressableSpaceTB by default is set to 32, with a maximum of 128. This controls the amount of host RAM the Pointer Block Cache consumes. With the default setting at 32, it will consume 128MB of host RAM (as mentioned previously), and with the maximum setting of 128, it will consume 512MB of host RAM. However, it’s important to note that this does not limit the capacity of open VMDKs on the ESXi 5.5 Host, just how many of the pointer blocks can stay cached in RAM. If only 20% of all VMDK blocks are active, you could conceivably be able to have 640TB or more of open VMDK capacity on the host, while still having the active pointer blocks cached without much, if any, performance penalty.

The way this new Pointer Block Eviction Process works gives you a sense of having an almost unlimited amount of open VMDK capacity per ESXi 5.5 host. But it’s not quite unlimited; there is a tradeoff as the amount of active VMDK capacity on an ESXi 5.5 host increases. The tradeoff is possible Pointer Block Cache Thrashing, which may impact performance.

With the default setting of MaxAddressableSpaceTB=32, the Pointer Block Eviction Process won’t kick in until the amount of open VMDKs exceeds 25.6TB. So if you aren’t expecting the VMs on your hosts to routinely exceed 25TB of open and active VMDK blocks, there is probably no need to even look at adjusting MaxAddressableSpaceTB; this saves you some host RAM that can be used for other things. In most cases, you would only have to adjust MaxAddressableSpaceTB if the active part of all open VMDKs on a host exceeds 25TB. If active VMDK blocks exceed the capacity of the Pointer Block Cache, then thrashing could result from constantly evicting and reloading pointer blocks, which may have a performance penalty.

You will see signs of Pointer Block Eviction in the VMKernel logs on your hosts if it is occurring. Syslog, vCenter Log Insight, or Splunk will help you spot this type of activity. If you start to notice any sort of performance impact, such as additional storage latency visible in KAVG in ESXTOP, and a correlation to Pointer Block Eviction, then that would be a sign you should consider adjusting MaxAddressableSpaceTB. If you’re planning to have 100TB of open VMDKs per host routinely, as in the case of large SQL Servers, we recommend setting MaxAddressableSpaceTB = 64 and adjusting upwards if necessary. If you’re not concerned about the amount of RAM the Pointer Block Cache will consume, you could consider setting it to the maximum of 128.

Increasing MaxAddressableSpaceTB may consume host RAM unnecessarily and so should be considered along with the total RAM per host and the RAM that is likely to be consumed by all VMs. 512MB of RAM consumed for Pointer Block Cache on a host with 512GB of RAM or more is not significant enough to worry about, but could be worth considering carefully if your hosts only have 32GB of RAM.

vSphere Storage Design for Maximum SQL Performance

We have so far covered SQL Server VM storage architecture from the database down to the data store. We are now ready to dive into VMware vSphere storage design and physical storage design to achieve maximum performance. This section will build on what we’ve covered already and help you to design an underlying storage architecture that supports your high-performance SQL Server systems on top of it. We will cover the impacts of number of data stores, data store queues, storage performance quality of service (QoS), storage device multipathing, RAID, and storage array features such as auto-tiering.

Number of Data Stores and Data Store Queues

The number of data stores you specify for your SQL Servers has a direct impact on the number of VMs and hosts that you can support in a vSphere cluster. The maximum number of data stores per host is 256, and all data stores should be visible to all hosts in a single cluster to ensure features such as VMware HA and DRS function correctly. For SQL Servers that will have a low IO requirement, you may be able to host a number of them on a single data store. This is one of the great benefits of using VMFS data stores over RDMs. Ultimately the number of data stores you need depends on how many IOPS you can get from a single data store, the combined IOPS and queue depth (QD) requirement of the VMs, and the queue depth you have configured per LUN on each vSphere host. For example, if each SQL Server consumes six LUNs or data stores and you can support four SQL Servers per host, your vSphere cluster would be limited to 10 hosts, plus one host for failure.

The IOPS for a particular data store is usually measured and specified in terms of IOPS per TB. This makes it very easy to explain to application owners what performance they should expect from their storage related back to the capacity. However, the calculation can become a little more complicated when features such as array auto-tiering, compression, and de-duplication are used. As part of designing your storage environment, we recommend you specify an SLA for each type of data store that is backed by a different class of storage (or different storage policy). As part of the SLA, calculate the IOPS per TB achievable and make this known to the application owners. Knowing the IOPS per TB achievable and required will also help if you are looking to host any SQL servers in a cloud environment. Whatever the IOPS per TB is for a particular data store, it will potentially be divided by the number of hosts sharing the data store, so you will most likely not be able to run a single host to the limit, unless there is only one VM on the data store.

In many cases, you can reduce the number of data stores you need to manage by increasing the queue depth per HBA LUN on each vSphere host. This allows you to place additional virtual disks on the data store, but without sacrificing the aggregate number of available storage IO queues. We recommend you do not increase the aggregate queue depth to the storage processors. By this we mean that by reducing the number of LUNs and increasing the queue depth per LUN, the total queue depth to the storage processor ports should be the same.

CAUTION

Be aware that if your storage is under-configured or already overloaded, increasing the queue depths won’t help you. You need to be aware of any queue depth limits on your storage array and processor ports and make sure that you don’t exceed them. If you overload a traditional storage processor and get a QFULL SCSI sense code, the storage controller (HBA) will drop the queue depth to 1 and slowly increase it over time. Your performance during this period will suffer significantly (like falling off a cliff). We recommend that you consult with your storage team, storage vendor, and storage documentation to find out the relevant limits for your storage system before changing any queue depths. This will help avoid any possible negative performance consequences that would result from overloading your storage. Some storage arrays have a global queue per storage port, and some have a queue per LUN. Whether your storage is Fibre Channel, FCoE, or iSCSI, you need to understand the limits before you make any changes.

TIP

The default queue depth on a QLogic HBA changed from vSphere 4.x to 5.x from 32 to 64. Emulex queue depth is still 32 by default (two reserved, leaving 30 for IO), and Brocade is 32. If you didn’t know this and simply upgraded, you could suffer some overloading on your backend storage processors. If your array is supporting vSphere hosts and non-vSphere hosts on the same storage processors, it is possible in some cases for the vSphere hosts to impact the performance of other systems connected to the same array. For more information and instructions on how to modify your HBA queue depth, see VMware KB 1267 and http://longwhiteclouds.com/2013/04/25/important-default-hba-device-queue-depth-changes-between-vsphere-versions/.

In Table 6.9, where the data store maximum number of VMs per host is 1, the maximum VMs on a given data store is effectively the maximum number of hosts that can be supported in a cluster. To increase the aggregate amount of active IOs per VM, you need to increase the number of LUNs and ensure VMs sharing those LUNs are split across hosts.

Table 6.9 Calculating Load on a VMFS Volume for Sample Configurations

Max Outstanding IO per LUN (n)	Avg. Active IO per VM (a)	LUN Queue Depth (d)	Max VM per Host m=(d/a)	Max VM on Data Store (n/a)
256	4	32	8	64
256	4	64	16	64
1,024	4	64	16	256
256	32	32	1	8
1,024	32	32	1	32
1,024	1	32	32	1,024
256	64	64	1	4
1,024	64	64	1	16

Table data sourced from http://www.vmware.com/files/pdf/scalable_storage_performance.pdf, with additional scenarios added.

You don’t just have to worry about your maximum LUN queue depths. You also have to consider the queue depths of your HBA. Many HBAs have a queue depth of 4,096, which means you’d only be able to support 64 LUNs per host at a queue depth of 64, assuming all queues were being used. Fortunately, this is rarely the case, and overcommitting queues at the host level has less drastic consequences than overcommitting queues at the storage array level. Any IOs that can’t be placed into the HBA queue will be queued within your vSphere host, and the consequence is increased IO latency, the amount of which will depend on your IO service times from your storage. Queuing inside your vSphere host can be determined by monitoring the QUED value and KAVG in ESXTOP. Recommended thresholds for average and maximum values can be found in Chapter 10.

The LUN queue depth isn’t the only value that you may need to modify in order to increase performance from your data store. The LUN queue setting goes hand in hand with the VMware vSphere advanced parameter Disk.SchedNumReqOutstanding (DSNRO). DSNRO is used to control the queue maximum depth per VM when there are multiple VMs per data store. The goal of this setting is to ensure fairness of IO access between different VMs. When there is only one VM per VMFS data store, the LUN queue depth will always be used. In vSphere, Disk.SchedNumReqOutstanding is a global value up until vSphere 5.5. In vSphere 5.5, Disk.SchedNumReqOutstanding is specified on a per-device basis. This setting is modified dynamically, as is the LUN queue depth when Storage IO Control is enabled on a data store with multiple VMs that is experiencing performance constraints.

TIP

You can set the per-device number of requests outstanding in vSphere 5.5 by using the command

esxcli storage core device set –d naa.xxx --sched-num-req-outstanding=<value>

where naa.xxx is the device name and <value> is a value from 1 to 256.

To list the storage devices on the system, use the following command:

esxcli storage core device list

By specifying the –d naa.xx option, you can confirm the setting has been changed as you expected. Also see VMware KB 1268 for further information.

Figure 6.23 shows the different queues at each level of the vSphere storage architecture. The two values that are usually worth monitoring as a vSphere admin are the AQLEN and the DQLEN. DQLEN can be adjusted up or down, depending on your requirements. For high-IO SQL Server systems where PVSCSI is used on VMDKs, we suggest you set the DQLEN to 64 as a starting point, while taking into account our previous recommendations when modifying queue depths.

Figure 6.23 VMware vSphere storage queues.

CAUTION

If you have an under-configured storage array and insufficient individual spindles or disks to service the aggregate IO requirements, then increasing the queue depth will not improve performance. On an under-configured array, increasing queue depth will just result in the queues becoming full and increased IO latency or service times. Virtualization doesn’t get around the laws of physics; you may need more disks. Our goal is to ensure the path from the guest to the underlying storage is not the bottleneck in software, so that you can get the most out of your physical storage investments and get the highest performance possible.

Figure 6.24 shows the different areas where storage IO latency is measured and the relevant values inside vSphere. DAVG, which is the device latency, will indicate if you have a bottleneck in your storage array, which may mean you need to add more disks or reduce the load on that device. If you start to see KAVG constantly above 0.1ms, this means the vSphere kernel is queuing IOs and you may need to increase device queue depth, especially if the DAVG is still reasonable (< 10ms).

Figure 6.24 VMware vSphere storage latency.

We want to optimize the queues through the IO stack so that the disk devices are the constraint, and not the software or queues higher in the stack. Periodic spikes in DAVG and KAVG are acceptable, provided the averages are not consistently high. Brief spikes in DAVG and KAVG are acceptable; however, high average values are a sign of a performance problem. Suggested thresholds are listed in Chapter 10.

TIP

When consolidating multiple SQL servers onto fewer hosts, there is usually an implicit assumption that SQL was not previously making full or optimal use of all of the system resources. This includes CPU and RAM, but also storage IO and HBA queues. It’s your job as the architect or admin of the environment to ensure your destination vSphere platform and each host has in aggregate sufficient resources to service the blended peak IO workloads of all of the databases on a single host. Once you know what the likely blended peak is, you can design your host platforms accordingly.

“Every millisecond of storage IO latency is potentially a millisecond that SQL can’t respond to an application request.” —Michael Webster

Number of Virtual Disks per Data Store

This section is only relevant if you’re building standalone SQL Server or using AlwaysOn Availability Groups with virtual disks (VMDKs). SQL FCI requires RDMs, and therefore each drive is mapped to a LUN and you can’t share the LUN with multiple VMDKs. You can, however, share the LUN and drive with more data files and achieve the balance of outstanding IOs to queue depth that way.

The number of VMDKs per data store will be limited by the performance characteristics of the data store and the performance requirements of the VMs and their VMDKs. Our primary goal when we decide on the number of VMDKs per data store is to try and balance the average number of active outstanding IOs per host with the queue depth of the data store. In most cases, not all VMDKs will use all their available queue depth all of the time, and not all VMs will use their available queue depth all the time either, but they may have peaks. We need to be able to handle these peaks within a reasonable time in terms of the IO latency or service time.

The example in Figure 6.25 shows a configuration where two VMDKs are on each data store. Each VMDK has a queue depth of 64, resulting in an over-commitment in queue depth of 2:1 from the VMDKs to the data store. On average, each VMDK will be able to issue 32 outstanding IOs (assuming they’re on the same vSphere host) before any additional queuing occurs in the vSphere kernel. If one VMDK is idle, the other VMDK can issue the maximum number of outstanding IOs to the data store and make use of the full queue depth. This may seem to be a rather conservative number of VMDKs per data store, but for very-high-performance systems this (or even 1:1 VMDK to data store) may be necessary to achieve the performance requirements.

Figure 6.25 Two VMDK per data store.

The example in Figure 6.26 shows a queue depth over-commitment of 4:1, assuming all VMDKs from a single VM on the single vSphere host. Each VMDK would be able to issue on average 16 outstanding IOs, while if the other VMDKs are idle an individual VMDK will be able to fill the entire queue.

Figure 6.26 Four VMDK per data store.

This is quite possibly fine for a single host and a single VM for this data store. But a data store is shared between all hosts in the cluster. If we only host a single VM on the data store and only on a single host, we are not able to utilize all of the queue depth that is usually available at the storage array. This assumes that the physical LUN configuration can support a higher aggregate queue depth and higher IOPS at the storage array level. If your backend storage is already performance constrained by its configuration, adding more queue depth and more VMs and VMDKs to the data store will only serve to increase latencies and IO service times.

Figure 6.27 shows two SQL VMs on two different ESXi hosts accessing the same data store. In this scenario, because each host has a LUN queue depth of 64, the combined queue depth to the LUN at the storage array could be up to 128. Provided the LUN can support the additional queue depth and IOPS without increasing latency, this would allow us to extract more performance from the same LUN, while reducing the number of LUNs that need to be managed. For this reason, sharing data stores between multiple VMs and VMDKs across multiple hosts can produce more optimal performance than alternatives. But it is important to make sure that each VM gets a fair share of the performance resources of the data store.

Figure 6.27 Multiple VMs on different ESXi hosts per data store.

TIP

To ensure that two VMs that are sharing the same data store do not reside on the same vSphere host, you can use vSphere DRS Rules to keep the VMs separated. This will reduce the chance of queue contention between the two SQL servers that might occur if they were on the same host. Having too many DRS Rules can impact the effectiveness of vSphere DRS and increase management complexity, so it’s use should be kept to a minimum. If you get your performance calculations slightly wrong and you discover one of the VMDKs is busier than you expected, you could easily migrate it to another data store using Storage vMotion. This can be done online and is nondisruptive to SQL. Some additional IO latency may be seen during the migration process.

Storage IO Control—Eliminating the Noisy Neighbor

One of the potential impacts of working in a shared storage environment is having one VM monopolize storage performance resources to the detriment of other VMs. We call this the Noisy Neighbor Effect. If one VM suddenly starts issuing a lot more IO than all the other VMs, it could potentially slow down other VMs on the same data store, or on the same array. To combat this problem, VMware introduced Storage IO Control (SIOC) in vSphere 4.1 and has made enhancements to it in vSphere 5.x.

Where there is more than one VM sharing a data store and SIOC is enabled, if the latency exceeds a threshold (default 30ms), vSphere will take action to reduce the latency. The way it reduces the latency is by dynamically modifying the device queue depth of each of the hosts sharing the data store. What it is doing is in effect trading off throughput for latency. The result is, individual VMs may see higher latency from storage but they each get their fair share of the storage performance resources.

SIOC should be activated only to deal with unexpected peaks of IO activity and should not be stepping in all the time. SIOC should be seen as more of a last resort. If you observe higher latency in your VMs and SIOC working constantly, this is an indication that your data store or storage platform can’t support the required IO workload. You may need to add more physical disks to your storage platform or reorganize some of the LUNs to reduce hot spots.

As shown in Figure 6.28, if one VM or one host begins to monopolize the available performance resources, the other VMs sharing the same data store or storage array suffer.

Figure 6.28 The Noisy Neighbor Effect.

In some cases, it’s not just that other VM’s performance suffers, but other more important VMs sharing the same data store don’t get the IO resources they are entitled to.

Figure 6.29 provides an example where three VMs share the same data store. One important VM and a less important VM share a vSphere host, while another less important VM is on another vSphere host. The relative importance is defined by the shares value, which uses a proportional share algorithm to carve up the performance resources. Because this doesn’t work across hosts, the less important VM on its own host has full access to the available queue depth and therefore is getting more than its fair share of IO performance resources.

Figure 6.29 Storage congestion without SIOC.

With Storage IO Control activated, the proportional share of resources and fairness are enforced across all hosts and for all VMs that share the same data store. In the example shown in Figure 6.30, SIOC takes action to reduce the queue depth that the less important VM has access to and to ensure that the most important VM gets its full entitlement to the available IO resources of the data store. Because Storage IO Control is only going to become active when there is congestion on the data stores, it is perfectly safe to use with array auto-tiering. SIOC will simply balance out the latency while the array moves blocks around if the operations cause any latency spikes.

Figure 6.30 Storage controlled with SIOC.

In vSphere 5.5, Storage IO Control uses an injector that periodically tests the performance capability of a given data store and can dynamically change the thresholds it uses to determine data store congestion. If you prefer to use the traditional method of a static latency threshold, you still can. The static latency threshold will be preferable if your storage array is using sub-LUN auto-tiering, where blocks of data may be migrated to different types of storage dynamically based on the performance profile requirements. If you used the injector method to determine congestion in conjunction with an auto-tiering array, there is a high probability the injector would get inaccurate data, because sometimes it would hit high-performance blocks and sometimes it would hit low-performance blocks.

We recommend you enable SIOC as a standard on all of your data stores when using traditional block-based storage arrays, regardless of whether or not they are hosting more than one VM. This will ensure if things change in the future you know that your VMs will always receive their fair share of the storage IO performance resources available. If you have an auto-tiering array, we would recommend using the traditional default values of 30ms for the static latency threshold and not using the injector with vSphere 5.5.

NOTE

The recommendations to use SIOC assume traditional block-based shared storage architecture is being used. Some modern storage systems don’t suffer from the problems that caused a need to have SIOC in the first place, and therefore there is no need to use SIOC on these systems. An example is the Nutanix Virtual Computing Platform, where data access is localized per host, although it provides a distributed shared storage environment. In this case, disk shares on each host ensure fairness of IO performance. The Nutanix platform doesn’t suffer from the problems that SIOC addresses, and therefore SIOC is not required.

Figure 6.31 shows the vSphere 5.5 Storage IO Control Settings dialog box. By setting SIOC to Manual, you effectively disable the injector, which is the preferred setting when using auto-tiering arrays, or storage platforms where the injector is likely to get inaccurate data.

Figure 6.31 vSphere 5.5 Storage IO Control settings.

vSphere Storage Policies and Storage DRS

With vSphere 5.x, we use Storage Policies and Storage DRS not only to reduce management overheads in a vSphere environment but also to improve performance. By using vSphere Storage Policies, you can take some of the guesswork out of provisioning your SQL Servers. By creating policies that align to the IOPS per TB and protection or availability requirements of your databases, it becomes very easy to provision new databases to the correct storage to achieve their requirements. You can manually assign storage capabilities to data stores and then create policies for those capabilities. Alternatively, you can use a storage vendor provider that leverages the vSphere API for Storage Awareness (VASA) to automatically provide visibility of the capabilities to vSphere. With VASA, when LUNs are provisioned at the physical array, the capabilities will flow through to vCenter. Storage Vendor Providers and storage capabilities are then visible when creating data stores. This allows vSphere Storage Administrators to easily include the correct storage into the correct data store, and this can later be used to create data store clusters.

Figure 6.32 shows a virtual data center where there are three possible storage policies that could be used based on the requirements of the SQL Server. You might choose Gold or Silver for different production or test database systems and you might choose Bronze for development databases. Your policies would be based on your particular requirements. To make it easy to architect for storage performance, the IOPS per TB should be known for each storage policy, and this should be communicated to all the key stakeholders until it is clear what they are getting when they provision VMs.

Figure 6.32 vSphere storage policies.

By pooling multiple (up to 32) similar data stores into data store clusters and using Storage DRS, you can ensure that initial placement of virtual disks to the best data store is automated, and this reduces the number of individual elements you need to actively manage. Storage DRS can be configured to load balance based on capacity, IO performance, or both, and can be set to simply make recommendations (manual) or be fully automated. If your array does not include automated storage block tiering, you can use Storage DRS to load balance data stores for IO performance, in addition to simply load balancing for capacity. When IO Load Balancing is enabled, Storage DRS works cooperatively with Storage IO Control and will collect IO metrics from the data stores and uses the IO injector to determine performance capabilities. The data is then analyzed periodically (by default, every 8 hours) to make IO load-balancing decisions. Importantly, the cost of any storage migrations is taken into consideration when making IO load-balancing decisions. Load balancing based on capacity or IO is achieved by performing Storage vMotion migrations between the source and destination data stores within a data store cluster.

The example shown in Figure 6.33 is of the standard storage DRS options, including the Storage DRS Automation Level, configured for Fully Automated, and the I/O metrics settings, which are disabled. You may wish to set Storage DRS to No Automation (Manual Mode) for a period of time during operational verification testing or if you are unfamiliar with Storage DRS and data store clusters, until you are familiar and comfortable with the recommendations it makes.

Figure 6.33 vSphere Storage DRS options.

CAUTION

Care should be taken when implementing Storage DRS on backend storage that is thin provisioned if it doesn’t include data de-duplication capabilities. Traditional thin provisioned backend storage capacity could become full if a storage migration takes place between one thin provisioned data store and another if the space is not reclaimed. Because the IO injector is used to determine performance capabilities when IO metric collection is enabled, it should not be used with auto-tiering arrays because the data it gathers will be inaccurate and your array is already managing the performance of each LUN. In the case of auto-tiering arrays, you should only use Storage DRS for initial placement and capacity-based load balancing.

The example in Figure 6.34 shows the Storage DRS Advanced Options expanded. Here, you can set whether to keep VMDKs together by default and other settings. These parameters will influence how much of an imbalance there needs to be before Storage DRS will consider taking action. The most relevant settings for SQL Server are “Keep VMDKs together by default” and the advanced option shown in this figure, “IgnoreAffinityRulesForMaintenance.”

Figure 6.34 vSphere Storage DRS advanced options.

The default option for Storage DRS will keep all VMDKs from a VM on the same data store. For a high-performance database, this is not what you would want. You will want to leverage the available data stores and queue depth to get the best performance while Storage IO Control sorts out any bumps in the road and ensures quality of service. Our recommendation for SQL Server environments is to have Keep VMDKs Together unchecked. This will cause Storage DRS to spread out the VMDKs among the available data stores. If you have large numbers of SQL Servers, it may be preferable to run them in a dedicated data store cluster, because this could limit the impact they have on other workloads, and vice versa.

If at a later stage you want to add data store performance as well as capacity, you can simply add more data stores to the data store cluster and they will be used for load-balancing operations per VMDK as well as during initial placement. Separating the VMDKs among the data stores will ensure quality of service and access to performance of all the databases added to the data store cluster while making administration and management significantly easier. We would recommend you leave the IgnoreAffinityRulesForMaintenance advanced setting at 0, unless you are willing to compromise your affinity rules and performance during data store maintenance operations.

In Figure 6.35, we have combined storage policies with multiple data store clusters. With the different virtual disks of each VM configured with a storage policy based on the required capabilities, the storage policy then maps to a particular data store cluster. Whenever a new VM is provisioned, its virtual disks will be provisioned in the correct data store cluster. The advantage of this method is that you can have the different VMDKs of a VM on a different class of storage—for example, where you want backup on a lower tier, or the OS on a lower tier, while the database files and transaction logs files are on a higher tier.

Figure 6.35 vSphere storage policies with data store clusters.

Having the flexibility of provisioning the VM to multiple storage clusters and different classes of storage sounds okay at a distance, but it also introduces additional management overheads. In storage platforms that already do automated block tiering, there is limited benefit to this approach. This approach is also difficult in Infrastructure as a Service (IaaS) environments or Cloud environments (including VMware vCloud Director or vCloud Automation Center), in which case a single VM may only be associated with a single storage profile, and automated tiering is used to manage the performance of the particular VM within the defined physical storage policy.

The sample diagram in Figure 6.36 shows multiple SQL Server VMs entirely within a single data store cluster, which would be backed by a single class of storage or single physical storage policy. Each VM’s individual VMDKs would be split among the data stores of the data store cluster. Storage Policies on each VM would dictate which data store cluster the SQL Server is assigned, but an individual VM is not split between multiple data store clusters, as was the case in Figure 6.35. This is the recommended approach in environments that support automated block tiering at the storage array.

Figure 6.36 Multiple SQL Servers—single vSphere storage cluster.

This design ensures simplified management and operations while ensuring the appropriate performance of the group of SQL Servers. It is also compatible with IaaS environments and use with Cloud environments, such as VMware vCloud Automation Center and VMware vCloud Director. You may still support multiple storage policies and service levels for the storage, each being a different data store cluster. But the VMs that map to those policies would be entirely contained within the relevant data store cluster.

vSphere Storage Multipathing

Each block storage device (FC, FCoE, iSCSI, and so on) on the VMware Hardware Compatibility List (HCL, http://www.vmware.com/go/hcl) leverages VMware Native Multipathing (NMP) and will have a Storage Array Type Plugin (SATP) and a default Path Selection Policy (PSP). The default SATP and PSP for your storage device will depend on the vendor, and in some cases it will use a VMware generic SATP, such as VMW_DEFAULT_AA. The PSPs that are part of the built-in VMware NMP are referred to as initiator-side load-balancing or path selection policies. This is because all path selection decisions are made from the host only.

There are three built-in PSPs to choose from: VMW_PSP_MRU, VMW_PSP_FIXED, and VMW_PSP_RR. To get the best performance out of your storage and provide the highest performance and lowest latency to SQL Server, we recommend you use the VMware Round Robin PSP (VMW_PSP_RR) where possible. Your storage vendor may have a particular best practice with regard to advanced options when using Round Robin that you should follow.

VMware has designed vSphere’s storage multipathing to be flexible and to allow storage vendors to write their own multipathing plugins. The advantage of many of the third-party vSphere multipathing plugins, such as EMC’s PowerPath/VE, is that they use target-side load balancing. This is where the load on the storage array’s paths, storage processors, and individual queue depths may be taken into consideration when choosing the best path for a particular IO operation. This can produce greatly improved performance and lower latency. Many vendors offer their own plugins, so you should check with your storage vendor to see if they have a plugin and what advantages it might have for your environment. Most of these plugins come at an additional cost, but in our experience it can usually be justified based on the additional performance.

The VMware vSphere Native Multipathing modules eliminate a lot of the problems and complications traditionally associated with in-guest multipathing drivers. To simplify your environment further, you could choose to put your VMDKs onto NFS data stores mounted to vSphere. When using NFS, your load balancing will most likely be done on the array, or by using the correct network teaming. NFS as a data store instead of VMFS is a great solution, provided it is designed and deployed correctly to meet the performance needs of your SQL Servers. The protocol itself will not be your limiting factor for performance, especially on 10GB Ethernet. Whichever storage option or protocol you choose, you just need to design it to meet your performance requirements and verify through testing that it does. There are many situations where NFS could be a valid option, and some of the benefits are covered in the section “SQL Server on Hyperconverged Infrastructure.”

vSphere 5.5 Failover Clustering Enhancements

In response to customer demands for increasing levels of database availability over and above the 99.9% easily obtainable with vSphere HA, VMware has provided a number of enhancements to the support of Windows Failover Clustering over the years. From vSphere 5.1, VMware supported five-node Windows Failover Clusters, where it previously supported only two nodes. In vSphere 5.5, VMware has again enhanced the Windows Failover Clustering support, and this is particularly relevant to high-performance SQL server databases that wish to make use of AlwaysOn Failover Cluster Instances.

Figure 6.37 shows the enhancements available when using AlwaysOn Failover Cluster Instances on vSphere 5.5.

Figure 6.37 vSphere 5.5 failover clustering enhancements.

The new failover clustering support in vSphere 5.5 means you can use the Round Robin multipathing policy to load-balance multiple active storage paths, Windows Server 2012 clusters are fully supported (up to five nodes when using RDMs), and FCoE and iSCSI protocols in addition to FC are supported for the RDMs.

RAID Penalties and Economics

Most storage arrays in use today use RAID (Redundant Array of Independent Disks) as a way to protect data from physical disk failures. Even though many newer storage arrays are starting to use different techniques for data protection, it’s still important to understand RAID. Using the right RAID levels can have a big impact on performance and also on cost of your SQL environments and virtual infrastructures. This section more than any other will clearly demonstrate how designing for performance will take care of capacity, at least where using RAID is involved, especially as you reach for even higher performance from your SQL databases. Before we discuss RAID penalties, we will cover some IO characteristics that have a direct performance impact when used with RAID.

Randomness of IO Pattern

The randomness of IO is a very important consideration in storage design and has a direct impact on IO latency and throughput when using spinning disks. Most virtualization environments will generate a completely random IO pattern, even with sequential IO from individual VMs, as we covered previously in “The IO Blender Effect.” This is because the underlying VMFS data stores are shared between multiple VMs in most cases. With SQL Server, you will have cases where VMs should still share some common VMFS data stores, as we have covered, in order to get maximum performance utilization from your arrays.

The reason that random IOs have such an impact is because the disk heads have to move between different sectors and the disk has to spin around to the correct location for a block to be read from or written to. For this reason, the average seek time and rotational speed of the disks are very important. On average, the disk heads will need to wait for 50% of the disk to spin past it prior to performing a read or write operation. Each operation is then multiplied by the RAID penalties of that operation.

The impact of randomness on reads can be worse than the randomness for writes. In most storage systems, writes will be cached (backed by battery or some other persistent form), ordered, and then written to physical disk in a way that reduces the overall impact. For reads, however, the chances of getting a cache hit in your array when the randomness increases are very low; therefore, most reads may have to come from spinning disks. The alternative would be to assign very large read cache on the array, but that is not efficient or cost effective in most cases, and still may not result in significant cache hits. The end result is that many more disks may be needed to get the best read latency and throughput for your database.

Fortunately, SQL is very good at caching, and this is why the buffer pool in a SQL Database is so big. This is also the reason there is a direct tradeoff between assigning SQL Server RAM and using it in the buffer pool and read IO from disk. This becomes especially important when things fail, such as disks in your RAID groups, which causes additional delays and additional latency.

Read/Write Bias

Just because your applications drive SQL to generate a read-biased workload doesn’t mean the underlying storage system will see a read-biased IO pattern. The reason for this is the SQL buffer pool is likely to mask a lot of read IO if you have sized your VM correctly. This will mean your IO patterns may be very write biased. Writes will be going to your data files, Temp DB files, and your transaction log all at the same time. You will need to make sure you have sufficient array write cache so you don’t get into a position of a force flush and a subsequent instance of the cache going write through, which will significantly degrade performance. You must have sufficient numbers of disks in the array to handle the cache flushes easily.

CAUTION

Be very careful when using 7.2K RPM SATA or NL-SAS disks on a traditional RAID array, even with automated storage tiering. Overloading a SATA or NL-SAS LUN can cause forced flush and significant periods of array cache write through (instead of the friendly cache write back), to the point where the storage processors may appear to freeze. Also, you may find LUNs being trespassed on active/passive arrays, or just lots of path flip flops on active/active arrays. With modern storage systems, including SSDs to host the active working set data and acting as a second cache area, the chances of forced flushes may be reduced. But you will need to ensure that your active working set doesn’t increase to the point where it overflows the caches and SSD and causes writes directly to slow tiers.

Plan Your Performance for Failure

Your storage system at some point will experience a failure. You need to ensure that your critical SQL systems will perform at the minimum acceptable level during these failure operations. During a disk failure in certain RAID configurations, you will have significantly slower performance for both read and write operations; this is due to parity calculations and the performance required for rebuilding data on replacement disks. Disk rebuild can take a significant amount of time, and during rebuild situations you may have a risk of multiple disk failure. The bigger and slower the disk, the longer the rebuild.

RAID Penalties

Random IO patterns, read/write bias, and failure events have a big impact on performance due to the overheads and penalties for read and write operations associated with using RAID. This is especially so with spinning disks. Storage array vendors have come up with many ways to try and work around some of the limitations with RAID, including the smart use of read and write caches. In your storage design, though, we recommend you plan your performance based on the physical characteristics of the underlying disks and plan for the rest to be a bonus. Table 6.10 displays the IO penalties during normal operations for each of the common RAID schemes.

Table 6.10 RAID IO Penalties During Normal Operations

RAID Level	RAID Write Penalty	Read IOPS (15K RPM)	Write IOPS (15K RPM)	Read IOPS (7.2K RPM)	Write IOPS (7.2K RPM)
RAID 0	1	175–210	175–210	75–100	75–100
RAID 1	2	175–210	88–105	75–100	38–50
RAID 5	4	175–210	44–52	75–100	18–25
RAID 6	6	175–210	30–35	75–100	12–16
RAID 1+0 (10)	2	175–210	88–105	75–100	38–50
RAID DP	2	175–210	88–105	75–100	38–50

IOPS listed in Table 6.10 are per disk. RAID 0 is included for illustrative purposes only and is not recommended, as it is simple disk striping with no data protection.

As you can see from Table 6.10, if you have a very write-biased workload, you could get very low effective IOPS from your RAID disks. This is the primary reason why arrays have write cache—and in some cases, lots of it. This allows the array to offset much of the penalty associated with writes to RAID groups of disks. But the arrays assume there will be some quiet time in order to flush the cache; otherwise, there will be an impact to performance. The calculation for write IOPS is as follows:

Write IOPS = Disk IOPS / RAID Write Penalty

However, this only works when things are going well. If you fill your cache by having too much write IO on slow spindles, or just from general overloading, your array will stop caching writes and bypass the cache altogether (go write through). In this case, you’ll get at best the raw performance of the RAID groups. This problem can be made worse when there is a disk failure and a group of RAID disks needs to be rebuilt. Depending on the type of disks, this can take many hours and severely impact performance during the rebuild operation.

Let’s take the RAID penalties a bit further and look at an example where we are sizing for performance. In this example, we will look at the requirements of a SQL data store that needs to be able to deliver 5,000 IOPS. We will assume that the workload is 70% read and 30% write, which is typical for some OLTP systems.

First, we need to calculate the effective number of IOPS required. This takes the 5,000 IOPS of a 70/30 read/write workload and adjusts for the RAID penalty as follows:

Required Array IOPS =
(Required IOPS * Read %) + RAID Write Penalty * (Required IOPS * Write %)

Example RAID 5 Required IOPS = (5000 * 70%) + 4 * (5000 * 30%) = 9500

You can see from the example in Table 6.11 that to achieve 5,000 IOPS for a 70% read-biased SQL workload, we need 9,500 IOPS at RAID 5 from the array. Now that we know the required array IOPS, we can calculate the number of disks required to achieve this performance at each of the RAID levels. To do this, we divide the number of IOPS by the number of IOPS per disk. RAID penalties have already been taken into consideration due to the previous calculations.

Table 6.11 Array IOPS Required at Different RAID Levels to Achieve 5,000 SQL IOPS

RAID Level	RAID WritePenalty	Array IOPS (70% Read)	Array IOPS (50% Read)	Array IOPS (30% Read)
RAID 0	1	5,000	5,000	5,000
RAID 1	2	6,500	7,500	8,500
RAID 5	4	9,500	12,500	15,500
RAID 6	6	12,500	17,500	22,500
RAID 1+0 (10)	2	6,500	7,500	8,500
RAID DP	2	6,500	7,500	8,500

To calculate the number of disks required to meet the required IOPS of a workload, we use the following formula:

Required Disks for Required RAID IOPS = Required Array IOPS / IOPS per Disk

Example RAID 5 Disks = 9500 Array IOPS / 210 IOPS per 15K Disk = 45 Disks

As Table 6.12 demonstrates, to achieve 5,000 SQL IOPS 70% read at RAID 5 on 15K RPM disks requires 45 disks, whereas it only requires 31 disks at RAID 1, RAID 10, or RAID DP—a saving of 14 disks. If the workload is only 30% read, then we would require 74 15K RPM disks at RAID 5 and only 40 15K RPM disks at RAID 1, RAID 10, or RAID DP. This would be a saving of 34 disks to achieve the same performance. This assumes each disk can achieve the high end of the IOPS for that device. The less number of IOPS per disk, the more disks in total will be needed. In this example, we’ve used the high-end IOPS of each disk for the calculations. Be sure to check with your storage vendor on their recommendations for IOPS per disk when doing any calculations.

Table 6.12 Min Disks Required at Different RAID Levels to Achieve 5,000 SQL IOPS

RAID Level	15K RPM Disk 70% Read	15K RPM Disk 30% Read	7.2K RPM Disk 70% Read	7.2K RPM Disk 30% Read
RAID 0	24	24	50	50
RAID 1	31	40	65	85
RAID 5	45	74	95	155
RAID 6	60	107	125	225
RAID 1+0 (10)	31	40	65	85
RAID DP	31	40	65	85

To achieve 5,000 IOPS at RAID 6 70% read on 7.2K RPM disks, we’d need 125 disks in total. At RAID 10 on 7.2K RPM disks, the required disks falls to 65, a saving of 60 disks. The difference is even more pronounced when the workload is only 30% read. At RAID 6, we would require 225 disks, whereas at RAID 10, we would only require 85 disks—a saving of a whopping 140 disks.

Those of you who know RAID will be thinking at this point that some of the numbers in Table 6.12 are wrong, and you’d be right. How do you get 31 disks in RAID 1 or 10, or 225 disks in RAID 6? The answer is, you don’t. The numbers in Table 6.12 have not been adjusted for the minimum required for a complete RAID group, or the likely size of each RAID group that would be required to make up an entire aggregate or volume to be created from. You would need to increase the numbers of disks to be able to build complete RAID groups. For example, in RAID 5, it’s common to build RAID groups consisting of 7 data disks +1 parity disk (8 total), and in RAID 6, it is common to build 8+2 or 10+2 RAID groups. RAID5 7+1 or RAID6 10+2 may be terms you’ve heard before when talking to storage administrators.

Now that we’ve adjusted the figures in Table 6.13 for the RAID groups, you can see that RAID 1 and 10 are even more efficient than RAID 5 and 6 in terms of the number of disks to achieve the same performance. This is important to understand because it also has a direct impact on the amount of capacity that will be provisioned to reach the desired performance level.

Table 6.13 Min Disks per RAID Group Adjusted to Achieve 5,000 SQL IOPS

RAID Level	15K RPM Disk 70% Read	15K RPM Disk 30% Read	7.2K RPM Disk 70% Read	7.2K RPM Disk 30% Read
RAID 0	24	24	50	50
RAID 1	32 (1+1)	40 (1+1)	66 (1+1)	86 (1+1)
RAID 5	48 (7+1)	80 (7+1)	96 (7+1)	160 (7+1)
RAID 6	60 (8+2)	110 (8+2)	130 (8+2)	230 (8+2)
RAID 1+0 (10)	32 (1+1)	40 (1+1)	66 (1+1)	86 (1+1)
RAID DP	36 (10+2)	48 (10+2)	72 (10+2)	96 (10+2)

For this part of the example, we’ll imagine that our SQL database that needs 5,000 IOPS will be 2TB in size. There will be an additional 200GB for transaction logs, 200GB for Temp DB, and another 100GB for the OS, page file, and so on. In totally, the capacity required is approximately 2.5TB.

From Table 6.14, you can see the usable capacity after taking into consideration the redundant or parity disks of the various RAID types needed to achieve 5,000 IOPS based on the previous examples. The 2.5TB usable capacity requirement for our sample SQL Server can easily be met by any of the selected RAID levels based on the number of disks required to achieve 5,000 IOPS. In fact, all of the RAID levels provide a lot more capacity than is actually required—some in the extreme.

Table 6.14 Usable Capacity Deployed to Achieve 5,000 SQL IOPS

RAID Level	300GB 15K RPM Disk 70% Read	300GB 15K RPM Disk 30% Read	1TB 7.2K RPM Disk 70% Read	1TB 7.2K RPM Disk 30% Read
RAID 0	7.2TB	7.2TB	50TB	50TB
RAID 1	4.8TB	6TB	33TB	43TB
RAID 5	12.6TB	21TB	84TB	140TB
RAID 6	14.4TB	26.4TB	104TB	184TB
RAID 1+0 (10)	4.8TB	6TB	33TB	43TB
RAID DP	9TB	12TB	60TB	80TB

Table 6.14 shows that a large amount of the deployed usable capacity is actually unusable from a performance perspective. Or to put it another way, you have way too much capacity at the end of your performance. This clearly demonstrates Principle 3 of sizing for performance, and in doing so, capacity will usually take care of itself.

Now that we have calculated the usable capacity that needs to be provisioned to achieve the 5,000 SQL IOPS, we can calculate the IOPS per TB. As mentioned, previously using IOPS per TB is a good way to communicate with applications teams how much performance they should expect for each TB of data based on the different available storage policies available. For this example, we will take a conservative approach so that the application teams are planning on a worst-case scenario and their performance surprises will be positive. You’ll remember the quote from Principle 3: “The bitterness of poor performance lasts long after the sweetness of a cheap price is forgotten.”

To illustrate this, we will define three tiers of storage or storage policies:

Gold—RAID10 300GB 15K RPM disks
Silver—RAID5 7+1 300GB 15K RPM disks
Wood—RAID6 8+2 1TB 7.2K RPM disks

We will base the IOPS per TB calculation on a 30% read-biased IO pattern. This will mean our DBAs and applications teams will likely get better performance than the defined service level.

Based on our example in Table 6.15, we could set an SLA for Gold at 800 IOPS per TB, Silver at 200 IOPS per TB, and Wood at 20 IOPS per TB. We have rounded down to take a conservative approach and ensure the SLA can always be met.

Table 6.15 IOPS per TB Based on Example 30% Read Workload at 5000 IOPS

Storage Policy	Disks Required for 5000 IOPS	Usable Capacity	IOPS per TB	Recommended SLA IOPS/TB
Gold (RAID10)	40	6TB	833 (5000 / 6)	800
Silver (RAID5 7+1)	80	21TB	238 (5000 / 21)	200
Wood (RAID6 8+2)	230	184TB	27 (5000 /184)	20

The Economics of RAID Performance

You have seen how performance requirements can drive storage design, and how many spinning disks are required when using different RAID levels to meet performance requirements. In our example, we used a SQL Server requiring 5,000 IOPS and 2.5TB capacity. Now we will look at the economics of different RAID choices and using solid state disks (SSDs) or enterprise flash disks (EFDs) instead of spinning disks.

From Table 6.15, in order to meet a 30% read 5,000 IOPS requirement and a 2.5TB capacity requirement, the Gold storage policy is the most cost effective. It would use half the number of disks to deliver the performance required, and more than covers the capacity requirements. It would be half the cost of the Silver storage policy for this workload. Now let’s take a look at how this might change if EFDs were used instead of spinning disks.

Table 6.16 shows the effective read and write IOPS after accounting for RAID penalties associated with using EFD disks with an assumed 5,000 IOPS per disk.

Table 6.16 RAID IO Penalties During Normal Operations of Enterprise Flash Disk

RAID Level	RAID Write Penalty	EFD Read IOPS	EFD Write IOPS
RAID 10	2	5,000	2,500
RAID 5	4	5,000	1250
RAID 6	6	5,000	833

Table 6.16 assumes a performance level of 5,000 IOPS for a single EFD disk. Depending on the type of EFD or SSD, these numbers could be very different. You should check with your particular vendor for their latest numbers. Also, it’s quite common for the read and write performance to be different even without the RAID penalties.

Table 6.17 shows the number of EFD disks required at different RAID levels to meet the IOPS as well as the capacity requirements of our sample SQL database workload.

Table 6.17 EFDs at Different RAID Levels Required for Example SQL DB

RAID Level	RAID Write Penalty	Array IOPS (30% Read)	Required # of EFD for IOPS	# of 400GB EFD for 2.5TB Capacity
RAID 10	2	8500	2	14
RAID 5	4	15500	4	8
RAID 6	6	22500	5	9

Table 6.17 illustrates the number of EFDs required to meet both the performance and capacity requirements of our sample SQL DB. In this example, the RAID 5 option is the most cost effective from a performance and capacity perspective.

Comparing the number of 400GB EFDs required to meet the SQL requirements against the most cost effective options for spinning disks (Gold Policy RAID 10), we can see that we need five times less EFDs. For this workload, the eight EFDs may be the best option if their combined cost is less than the 40 spinning disks. In many cases, the EFDs will be less cost, especially when the reduced space, power consumption, and cooling of EFDs is considered.

Let’s add a Platinum storage policy in addition to the previous defined policies and calculate the effective IOPS per TB based on our 400GB EFD example.

With the new Platinum storage policy in Table 6.18, we can easily meet the performance requirement of 5000 IOPS, but we need additional disks to meet the capacity requirement. Table 6.15 shows us that we need eight EFDs at 400GB in order to achieve the required 2.5TB. Based on provisioning 2.8TB of usable capacity, we can calculate that our achievable IOPS from that capacity at a conservative 4000 IOPS per TB at RAID5 with write penalty of 4 is 11,200 IOPS. At this point, it’s likely that we’d run out of capacity well before running out of performance.

Table 6.18 IOPS per TB Based on Example 30% Read 5,000 IOPS and 2.5TB Capacity

Storage Policy	Disks Required for 5000 IOPS	Usable Capacity	IOPS per TB	Recommended SLA IOPS / TB
Platinum (400GB EFD RAID5)	4	1.2TB	4,166 (5000 / 1.2)	4000
Gold (300GB 15K RPM RAID10)	40	6TB	833 (5000 / 6)	800
Silver (300GB 15K RPM RAID5 7+1)	80	21TB	238 (5000 / 21)	200
Wood (1TB 7.2K RPM RAID6 8+2)	230	184TB	27 (5000 /184)	20

NOTE

There are many new storage platforms that include only flash as part of their architecture, meaning the entire array may become your primary tier. Some of these platforms claim to offer economics similar to spinning disks, by using advanced compression and data de-duplication techniques. These platforms are normally aimed at the highest performance workloads, such as critical SQL databases. These types of storage platforms are unsurprisingly known as “All Flash Arrays,” and come from the likes of EMC, NetApp, HP, PureStorage, Violin Memory, and others.

At this point, you might consider doubling the size of each EFD to 800GB. This would halve the number of disks required to meet the capacity requirements. Assuming that each individual 800GB EFD has the same IOPS performance as the 400GB versions, you could achieve a better balance of performance and capacity. The larger EFDs would have half the IOPS per TB—in this case, to around 2,000. Five EFDs would be required to reach the required capacity. This would mean 3.2TB of usable capacity is deployed. The achievable IOPS from the deployed usable capacity would drop to 6,400. This is still a more performance than required. Also, although we are only using 5 × 800GB EFDs instead of 8 × 400GB EFDs, because they are double the capacity, they are also likely to be double or more the cost.

An EFD might be marketed at 400GB or 800GB in size, but to protect against wear of the NAND flash cells, the disk will usually have more physical capacity. This is to provide more endurance and a longer service life. This may vary between different vendors and individual SSDs, and we recommend you check with your storage vendor.

TIP

EFDs and SSDs are dollars per GB but cents per IOP, whereas spinning disks are cents per GB and dollars per IOP. In order to achieve the best balance, you need some of each. This is why many types of storage array include automatic storage tiering. Automatic storage tiering is most effective when done at the block level because individual blocks can be moved between the EFD and spinning disk storage as performance and capacity needs change. Where available, we recommend you use automatic storage tiering and seek advice from your storage vendor to ensure effective implementation and operations.

To make calculating performance and capacity based on different types of disk, numbers of disks, and RAID types easy, see the calculator at http://www.wmarow.com/strcalc/.

NOTE

There are many new types of enterprise storage systems and converged architectures on the market today that have moved away from using RAID as the main means of data protection and instead have their own methods. Often these alternative methods can achieve the same reliability and data protection levels as RAID, but without all of the complication and performance penalties. If you are using a system that doesn’t rely on RAID for data protection, you can safely ignore this section. You should seek advice from your vendor with regard to sizing for capacity and performance based on their data protection methods and overheads.

SQL Performance with Server-Side Flash Acceleration

There is one storage technology that is currently sweeping the IT industry and revolutionizing performance, and that is NAND flash, in the form of SSDs, EFDs, and PCIe devices. When it comes to SQL performance, we think the lyrics of the Queen song “Flash Gordon” are very appropriate (see Figure 6.38). I wonder if they could see the future of enterprise and web-scale data centers when they wrote that song? Either way, as the previous section illustrated with the discussion around SSD and EFD in your storage array (including All Flash Arrays), it liberates performance for SQL from the tyranny of slow spinning disks that may no longer be economic.

Figure 6.38 Flash acceleration and lyrics from the classic Queen song “Flash Gordon.”

But flash in an array has some limitations, and there is another location where we can use flash SSDs, EFDs, and PCIe that can greatly improve SQL performance, directly in the VMware ESXi servers hosting SQL. This is where server-side flash and associated acceleration solutions come in. Server-side flash when used as part of an IO acceleration solution can be thought of as cheap memory, rather than expensive disk. It is definitely cents per IOP and dollars per GB, but the returns on investment and performance can be substantial. Especially when it is not possible to add more RAM to the buffer cache, which would be the fastest possible storage from a performance perspective.

By using server-side flash acceleration, you can normally consolidate more SQL VMs per ESXi host, with less memory directly assigned to each SQL VM, and without sacrificing performance and user response times. Read or write IOs are offloaded to the local server flash device, and this acts as a very large cache. It can also greatly reduce the load on the back-end storage, which allows the array to improve its efficiency.

Because the flash devices are local to the server, the latencies can be microseconds (us) instead of milliseconds (ms) and eliminate some traffic that would normally have gone over the storage network. By reducing the storage IO latencies, not only are user response times improved, but overall server utilization is improved. You may see increased CPU utilization, as you are able to get more useful work done by reducing system bottlenecks.

In this section, we cover three different server-side flash acceleration solutions that are supported with VMware vSphere and can greatly improve the performance of your SQL databases. The solutions we cover are VMware vSphere Flash Read Cache (vFRC), which is included with vSphere 5.5, Fusion-io ioTurbine (IOT), and PernixData Flash Virtualization Platform (FVP). The first two solutions act as a read cache only, as all writes go directly to the backend storage while being cached and are therefore write through. PernixData FVP, on the other hand, offers a full write back cache, where both read IO and write IO can be accelerated.

VMware vSphere Flash Read Cache (vFRC)

vSphere 5.5 introduces vSphere Flash Read Cache, or vFRC, which is an infrastructure layer that aggregates flash devices into a unified flash resource pool. vFRC supports locally connected flash devices such as SAS/SATA SSDs and PCIe. The flash resource can be used to cache read IOs and is configured on a per-VMDK basis. The vFRC write policy is write through, which means that all writes go to persistent storage and are cached in vFRC simultaneously. To prevent pollution of the cache, large sequential writes are filtered out. Each VMDK flash resource allocation can be tuned based on the workload. For SQL, it’s recommended that data file VMDKs and Temp DB VMDKs be configured for vFRC when used, whereas transaction log will usually have little benefit.

Figure 6.39 shows a high-level overview of the VMware vSphere Flash Read Cache architecture.

Figure 6.39 vFRC architecture overview.

The types of SQL workloads that will benefit from vFRC are read-dominated OLTP-type systems and read-dominated data warehouse queries. The ideal workload has a high repeated access of data—for example, 20% active working set that is referred to 80% of the time.

The major determinants of performance are the cache size, the cache block size, and the type of flash device used (SSD vs. PCIe). In terms of cache sizing, it is important to ensure that the cache is big enough to cover the active working set without being too big that you’re wasting the valuable flash resource. The cache block size should be equal to the dominant IO size of the VMDK; for SQL, this will be predominantly between 8KB and 64KB. If you are unsure of the main IO size for your database, you can use vscsiStats for a period of time to record the IO profile. To learn more about vscsiStats, see http://cormachogan.com/2013/07/10/getting-started-with-vscsistats/.

The type of flash device used will have an impact on the overall IOPS and latencies you can achieve. Although SATA and SAS SSDs are cheaper, they do not offer the same performance as PCIe. The right device for your environment will depend on your workload, performance, and budgetary requirements.

Having a cache block size that is too big can cause fragmentation in the cache and poor utilization. This may cause a substantial portion of the cache resource to be unutilized and therefore wasted. Figure 6.40 illustrates the impact of vFRC block fragmentation.

Figure 6.40 vFRC block fragmentation.

In Figure 6.40, the vFRC block is set to a much larger size than the predominant IO size—in this case, 128KB or 512KB versus the actual IO size of 8KB. As a result, a large proportion of the blocks configured is wasted.

The cache size and block size are manually set when you enable vFRC on a VM, and they can be changed at runtime without disruption. Having the cache too small will cause increased cache misses, and having it too big is not just wasteful, it will impact your vMotion times. By default, when vFRC is configured, the cache of a VM will be migrated when the VM is vMotioned. If it’s set too big, this will increase the vMotion times and network bandwidth requirements. You can, if desired, select the cache to be dropped during a vMotion, but this will have an impact on SQL performance when the VM reaches its destination while the cache is being populated again.

Fusion-io ioTurbine

ioTurbine is caching software from Fusion-io that leverages the Fusion-io ioMemory range of high-performance flash devices, such as the SLC- and MLC-based ioDrive and ioScale PCIe cards. ioTurbine creates a dynamic shared flash pool on each ESXi server that can be divided up between cache-enabled VMs based on proportional share algorithm. By default, each VM is assigned the same shares and thus get an equal proportion of the available flash cache resource pool.

Like VMware’s vFRC, ioTurbine is a read cache, and all writes are sent through to persistent storage while simultaneously being cached. Unlike vFRC, there are no manual parameters to set on a per-VM basis to size the cache or the blocks that are cached. This automatic and dynamic sizing of the flash cache of each VM is useful where you have lots of VMs that can benefit from caching or where you have flash devices of different sizes on different hosts. It reduces the management overhead.

Figure 6.41 displays a high-level overview of the ioTurbine architecture, including Fusion-io’s Virtual Storage Layer (VSL) driver. As of ioTurbine 2.1.3, which supports vSphere 5.5, the VSL SCSI driver is used by default instead of the VSL block driver. This can provide improved performance and better resiliency.

Figure 6.41 ioTurbine architecture overview.

In addition to being able to cache a VM, ioTurbine is capable of caching disks, files, and entire volumes. With the optional in-guest agent, the caching becomes data and application aware. This means particular files within the OS can be cached while others are filtered out. This is very useful for SQL where we only want the data files and Temp DB files cached while the transaction logs are not cached.

ioTurbine is fully compatible with VMware features such as DRS, HA, and vMotion. ioTurbine also works in environments where not all ESXi hosts contain a flash device, in which case the flash cache of a server would be set to 0.

In the example in Figure 6.42, if one of the VMs in the left ESXi host is migrated to the right ESXi host, all VMs will be allocated one third of the flash cache capacity of each host because there will be three cached VMs on each host.

Figure 6.42 ioTurbine dynamic and automatic allocation of flash cache.

Table 6.19 was obtained from Fusion-io performance test results published at http://www.fusionio.com/blog/performance-of-a-virtualized-ms-sql-server-poor-ioturbine-to-the-rescue. The results demonstrated that by offloading reads to the ioTurbine flash cache, write performance also increased by just over 20%. This test was based on TPC-E workload. This demonstrates that read caching can also improve write performance to a certain extent.

Table 6.19 ioTurbine SQL Server Performance Example (TPC-E)

	ioTurbine Off	ioTurbine On	Improvement
Avg. Duration (us)	146,861	29,800	400%
Avg. CPU Time Consumed	22	22	None
Total Reads	95,337,525	127,605,137	34%
Total Writes	34,901	43018	23%

PernixData FVP

PernixData FVP is different from the other two solutions already discussed in that it aggregates server-side flash devices across an entire enterprise to create a scale-out data tier for the acceleration of primary storage. PernixData FVP optimizes both reads and writes at the host level, reducing application latency from milliseconds to microseconds. The write cache policy in this case can be write back, not just write through. When the write back cache policy is used, the writes are replicated simultaneously to an alternate host to ensure persistence and redundancy in the case of a flash device or host failure.

Application performance improvements are achieved completely independent of storage capacity. This gives virtual administrators greater control over how they manage application performance. Performance acceleration is possible in a seamless manner without requiring any changes to applications, workflows, or storage infrastructure.

Figure 6.43 shows a high-level overview of the PernixData Flash Virtualization Platform architecture.

Figure 6.43 PernixData FVP architecture overview.

The flash devices in each ESXi host are virtualized by FVP, abstracted and pooled across the entire flash cluster. As a result, you can have flash devices of differing types and sizes in different hosts. Ideally though, you will have a homogenous configuration to produce more uniform performance acceleration. Hosts that don’t have local flash devices can still participate in the flash cluster and benefit from read IO acceleration. This is termed a “non-uniform configuration,” when some hosts have local flash devices and some don’t.

In the case of a non-uniform flash cluster configuration, when a VM on a host without a flash device issues a read operation of data already present in the flash cluster, FVP will fetch the data from the previous source host and send it to the virtual machine. Because there is no local flash resource present, it cannot store it locally; however, FVP will continue to fetch data from the flash cluster to keep the latency to a minimum while reducing the overall stress and load on the storage array.

With PernixData FVP, it may be possible to delay the need for costly forklift upgrades of existing primary storage investments that have reached the end of their performance, well before the end of their capacity. As we’ve seen with our RAID calculations, this can be common for high-performance workloads. FVP can provide much more efficient use of the deployed capacity and may allow the breathing space required for you to determine the best next steps for your future storage and virtualization strategies.

The examples in Figures 6.44 and 6.45 show a SQL 2012 database driving around 7,000 IOPS consistently and the resulting latency both at the data store and at the VM level. The total effective latency is what the virtual machine sees, even though the data store itself is experiencing drastically higher latency. In this case, in spite the latency of the data store being upwards of 25ms, the SQL VM response times are less than 1ms.

Figure 6.44 PernixData FVP acceleration for SQL Server 2012 IOPS.

Figure 6.45 PernixData FVP acceleration for SQL Server 2012 latency.

When FVP cannot flush the uncommitted data to primary persistent storage fast enough—that is, when more hot data is coming in than there is flash space available—FVP will actively control the flow of the new data. This means that FVP will artificially increase the latency, ultimately controlling the rate at which the application can send, until the flash cluster has sufficient capacity and returns to normal. FVP does not transition to write through, even when it is under heavy load. Applications normally spike and are not continuously hammering the data path 100% all time, so FVP flow control helps smooth out the “spikey” times, while providing the most optimized performance possible.

CAUTION

Migrating a VM in an FVP flash cluster, in certain network failure scenarios, or when the local or replica flash device fails, FVP will automatically change the write back policy to write through. This ensures data protection, while degrading write performance. However, reads may still be accelerated by requests being serviced from the remainder of the flash cluster. When the issue is resolved the policy will be automatically returned to write back. For more information, see the “Fault Tolerant Write Acceleration” white paper on http://pernixdata.com and http://frankdenneman.nl/2013/11/05/fault-tolerant-write-acceleration/. This is a standard part of the FVP Fault Tolerant Write Acceleration Framework.

SQL Server on Hyperconverged Infrastructure

If there is one technology trend that is revolutionizing the enterprise data center more than just flash alone, it is hyperconvergence. This is where storage and compute (CPU and RAM) are provided in a single package and connected by standard Ethernet networks. By far the leader in this sector of the market is Nutanix, with its Virtual Computing Platform. This section covers key aspects of SQL Server performance and architecture of the Nutanix Virtual Computing Platform.

The Nutanix Virtual Computing Platform is built for virtualization and cloud environments, with the idea of brining the benefits and economics of web-scale architectures from companies such as Google, Facebook, and Amazon to the masses. The Nutanix solution includes storage and server compute (CPU and Memory) in a platform building block. Each building block is 2 RU and based on standard x86 server technology. The platform architecture is designed to deliver a unified, scale-out, shared-nothing cluster with no single point of failure (SPOF). Hyperconverged platforms don’t require SAN or NAS storage, or fibre channel networks, but can sit along side existing environments.

A general aspect of hyperconverged platforms and Nutanix in particular is a reduction in the number of components that need to be managed and a reduction in the overall solution complexity. The reduction in complexity and increased simplicity translates into ease of deployment and operations, such as when dynamically increasing a cluster’s size, and ease of designing and architecting successful solutions, even for business-critical applications such as SQL Server.

For designing a SQL Server environment, a Nutanix platform is arguably simpler because there are no LUNs, no RAID, no FC switches, no zoning, no masking, no registered state change notifications (RSCN), and no storage multipathing required. All management is VM and VMDK centric. An advantage of being VM and VMDK centric is that storage IO from a VMDK is seen as what it is: sequential is sequential and random is random. This allows the platform to optimize for that IO pattern without the impact of the IO Blender Effect.

This doesn’t mean you have to throw away the assets you’ve already got and that still have a book value. You can use a hyperconverged platform to offload some capacity and performance from your existing systems. This can improve your overall performance and reduce management complexity.

With Nutanix, you have one pool of storage across a distributed file system cluster called the Nutanix Distributed File System (NDFS), which includes SSDs for high performance and low latency and HDDs for cheap capacity. The different types of storage devices in the storage pool are automatically tiered using an intelligent information life cycle management (ILM) engine to ensure the most frequently used data is available in memory or in flash cache. This assumes you have sufficient capacity in your high-performance tier for the most active working set of your VMs. If you are deploying SQL Server on Nutanix, the sections of this chapter you need to follow closely are “SQL Server Database and Guest OS Design” and “Virtual Machine Storage Design,” in addition to “The Five Key Principles of Database Storage Design,” which appears at the start of this chapter.

Nutanix has a small number of model options available to try and make it easy to choose the right one and to make it easy to support. Depending on the model of platform selected, a single 2U building block can include up to four nodes, combining up to 80 CPU cores (two sockets, 10 cores each per node), 2TB RAM (512GB per node), and 8TB of high-performance storage. These building blocks can be scaled out without any artificial limits and provide linear performance as you add more VMs. If more capacity is required per node, a different building block type with up to 16TB–20TB per 2RU can be mixed and matched into a single NDFS cluster to balance both compute capacity and storage capacity and performance. Typical performance from a 2RU building block is up to a combined 100K 4KB Random Read IOs, up to 50K 4KB Random Write IOs, 1.4GBps sequential write throughput, and 3GBps sequential read throughput across four NDFS nodes. These numbers were produced using the built-in Nutanix Diagnostics Tool; actual application performance with mixed workloads will vary. You should benchmark your particular applications and seek advice from Nutanix on your particular virtualization scenarios. It should be noted that SQL Database predominant IO size will be 64KB or above if you have followed the guidance so far in this chapter.

Figure 6.46 shows an overview of the Nutanix Virtual Computing Platform Architecture, including each hypervisor host (VMware ESXi), SQL VMs (User VMs), Storage Controller VM (Controller VM), and its local disks. Each Controller VM is directly connected to the local storage controller and the connected disks using VMware DirectPath/IO. By using local storage controllers on each ESXi host access to the NDFS file system, the data access path is localized and doesn’t always require transport over the network, thereby reducing network traffic and potentially improving performance, predominantly for read operations. NDFS ensures that writes are replicated and distributes data within the cluster for data protection. The local storage controller on each host ensures that storage performance as well as storage capacity increase when additional nodes are added to a Nutanix NDFS cluster.

Figure 6.46 Nutanix Virtual Computing Platform Architecture overview.

Figure 6.47 shows an overview of a single Nutanix NDFS cluster combining many different workloads, including SQL Server VMs, into different VMware vSphere clusters.

Figure 6.47 SQL Server on the Nutanix Virtual Computing Platform.

Although the storage is local to each node, NDFS makes it appear to the hypervisor as shared storage and therefore integrates with VMware DRS, HA, and fault tolerance. The combination of SSD and HDD local storage in addition to automated tiering is aimed at balancing both cost and performance. Also, NDFS data protection techniques remove some of the performance penalties associated with RAID. The localization of data allows for performance and quality of service to be provided per host, so noisy VMs can’t greatly impact the performance of their neighbors. This allows for large mixed workload vSphere clusters that may be more efficient from a capacity and performance standpoint, while being resilient to failure.

TIP

Nutanix has a “SQL Server Best Practices” white paper and reference architecture available at http://go.nutanix.com/rs/nutanix/images/sql-on-nutanix-bp.pdf. For detailed information on the entire Nutanix architecture, see the Nutanix Bible by Steven Poitras at http://stevenpoitras.com/the-nutanix-bible/. The “VMware vSphere on Nutanix Best Practices” white paper (available at www.nutanix.com) covers in detail each vSphere feature and how it should be designed and configured in a Nutanix environment.

Due to the simplified nature of the Nutanix storage architecture and NDFS, we can simplify the storage layout for SQL Server. Figure 6.48 includes a sample layout, which is standard in a Nutanix environment, consisting of a single NFS data store and single storage pool. We do not need to configure multiple LUNs or calculate LUN queue depths.

Figure 6.48 SQL Server VM disk layout on the Nutanix.

For high-performance, critical databases we would recommend you include 4 × PVSCSI controllers and split up the data files, Temp DB, and transaction logs similarly to that described in the section on SQL VM storage layout. With the four PVSCSI adapters available, we recommend that you start with two VMDKs per controller and expand the number of virtual disks per controller as evenly as necessary.

The simplified storage layout potentially provides a number of benefits to each type of SQL Database. Table 6.20 outlines some of the benefits you may be able to expect.

Table 6.20 Nutanix Benefits for OLTP and OLAP SQL Databases

	Nutanix Benefits
SQL OLTP Transactional Database	Localized I/O for low-latency operations SSD for indexes and key database files Ability to handle random and sequential workloads without the impact of the IO Blender Effect
SQL OLAP Analytical Database	Local read I/O for high-performance queries and reporting High sequential write and read throughput Scalable performance and capacity

Nutanix Benefits

SQL OLTP

Transactional Database

Localized I/O for low-latency operations

SSD for indexes and key database files

Ability to handle random and sequential workloads without the impact of the IO Blender Effect

SQL OLAP

Analytical Database

Local read I/O for high-performance queries and reporting

High sequential write and read throughput

Scalable performance and capacity

To demonstrate the capability of the Nutanix platform for SQL Server, a number of SQLIO benchmarks were performed as part of the “SQL on Nutanix Best Practices” white paper (http://go.nutanix.com/TechGuide-Nutanix-SQLBestPractices_Asset.html), reproduced here with permission. Figures 6.49 through 6.52 resulted from the benchmarks.

Figure 6.49 SQL Server SQLIO single VM random IOPS by block size.

Figure 6.50 SQL Server SQLIO single VM throughput by block size.

Figure 6.51 SQL Server SQLIO multiple VM IOPS scalability.

Figure 6.52 SQL Server SQLIO multiple VM throughput scalability.

Figures 6.49 through 6.52 show different performance profiles of the Nutanix Virtual Computing Platform for SQL Server VMs based on the “Nutanix SQL Best Practices” white paper, which includes the detailed configuration and testing details as well as individual IO pattern scenarios. Because most environments consist of mixed IO workloads, you should baseline your environment and consider the impact of IO mix and different IO sizes. The Nutanix platform can coexist with existing storage investments and offload workloads from existing storage platforms, thus freeing up both capacity and performance. It is a valid consideration for SQL Databases that fit within the performance envelope of the scale-out platform.

Summary

Throughout this chapter, we have provided architecture examples based on real-world projects that you can adapt for your purposes. We’ve tried to explain all the relevant considerations and best practices you need to worry about when architecting your environment for high-performance and critical SQL Server databases. We covered the key aspects of SQL Server storage architecture for all environments as well as the differences you need to understand when architecting storage specifically for virtual SQL Server databases, such as the IO Blender Effect and the way IO queues work across hosts on the same data store.

We provided guidance on important database storage design principles and a top-down approach covering SQL Server Database and Guest OS design, Virtual Machine Storage design, VMware vSphere Storage Design, and then down to the physical storage layers, including RAID and using server-side flash acceleration technology to increase performance and provide greater return on investment. We concluded the chapter by covering off one of the biggest IT trends and its impact on SQL Server: hyperconvergence and scale-out, shared-nothing architectures.

Let’s briefly recap the key SQL design principles:

Your database is just an extension of your storage. Make sure you optimize all the IO paths from your database to storage as much as possible and allow for parallel IO execution.
Performance is more than just the underlying storage devices. SQL Buffer Cache has a direct impact on read IO, whereas virtual IO controller device queues and LUN, HBA, and Storage Processor queues can all impact performance and concurrency of IO before anything touches a physical storage device.
Size for performance before capacity. If you size for performance, capacity will generally take care of itself. Much of this is due to the overheads associated with RAID storage needed to provide enterprise-grade data protection and resiliency. Use flash storage and automatic tiering to balance the performance and capacity requirements to get a more cost-effective solution overall.
Virtualize, but without compromise. This involves reducing risk by assessing current performance, designing for performance even during failure scenarios, validating your design and its achievable performance, and ensuring storage quality of service, such as Storage IO Control. These all contribute to a successful SQL virtualization project. Make sure project stakeholders understand what performance to expect by having SLAs aligned to achievable IOPS per TB.
Keep it standard and simple. Whatever design decisions you make for your environment, keep them as consistent as possible and have defined standards. Design for as few options as possible in your service catalog that cover the majority of system requirements. Only deviate from defaults where required.

We have covered storage performance in depth, as it is one of the most critical resources for a SQL Database. The next chapter will drill into how SQL memory allocation impacts the performance of your database, and how SQL and memory might change in the future.