[Originally posted on Turbonomic.com in January 2021.
Storage is one of the top cloud services globally. AWS and Azure offer various types of storage services, such as AWS object storage, Azure Blobs, Azure Queues, and more. This blog will focus on the block level storage service, which is used with AWS EC2 instances and Azure Virtual Machine instances and provides boot and data volumes.
Cloud performance often suffers due to storage configurations—requiring special storage metrics monitoring and analysis to be detected and alleviated.
On average, there are 3-4 volumes attached to each enterprise VM, meaning there are 3-4x the chances for performance issues or underutilized storage devices. While individual disks are often cheaper, there are many cases when a single volume can be more expensive than the entire VM. Optimization at the individual volume layer is often overlooked. The complexity of large cloud estates is far beyond human scale to understand on an ongoing basis, with millions of configuration options and storage metrics to consider in meeting application demand. This leads cloud operation teams to dramatically overprovision their storage to avoid performance problems, an approach that doesn’t always work and, to make matters worse, brings great unnecessary expense. Turbonomic is here to help.
Validating IOPS and Throughput at the VM layer often does not expose storage performance issues that are happening at the individual volume layer. You might have IOPS congestion or Throughput congestion at the individual volume layer, while the VM specific metrics look fine, transforming root cause analysis investigations into a wild goose chase. Looking at each volume individually can potentially reveal these issues, but requires significantly more time and effort, especially if attempted against thousands of volumes.
Throughput limitations are not caught by observing and scaling for IOPS, and lOPS limitations are not always solved by scaling to higher storage tiers.
The correct way to solve storage issues is to consider the available storage tiers, and the ability to acquire higher IOPS or Throughput within the same tier by adjusting storage capacity amount, or adjusting the IOPS directly if the storage tier is capable of it. This allows us to arrive at a perfectly acceptable storage configuration without defaulting to massively expensive high end tier storage.
Volume Performance Measurement 101
IOPS and IO Throughput are two important factors to measure volume performance. AWS and Azure offer the ability to monitor IOPS and IO Throughput values per volume.
IOPS is the number of operations (read/write) per second.
IO Throughput is the amount of data transferred (read/write) per second, which has a linear relationship with IOPS and with I/O size. The equation is:
Throughput (MiB/s) = I/O size (MiB) x IOPS (/s)
Block Storage tiers offered by AWS and Azure also differ in IOPS and Throughput performance characteristics. We will cover volume tiers’ performance in following sections.
AWS and Azure Block Storage Tiers Offerings
Amazon EBS volume types offer different volume size ranges, IOPS, and Throughput limitations. In most cases, pricing is based on provisioned resources.
In December 2020, AWS announced the availability of gp3 tier, which offers cost-effective storage that is ideal for a broad range of workloads. EBS gp3 delivers a consistent baseline rate of 3,000 IOPS and 125 MiB/s included with the price of the storage and offers independent scaling of IOPS and throughput. From pricing perspective, gp3 provides up to 20% lower price per GB than existing gp2 volumes. Turbonomic was part of the private preview program with AWS, which accelerated support for gp3 storage tier and will be available in January 2021. Stay tuned for an announcement on this topic.
Azure offers two types of disks: Managed disks and Unmanaged disks. Unmanaged disks, including unmanaged Premium SSD, unmanaged Standard HDD, and require users to take care of storage accounts, encryption, data recovery plans etc. For Azure Managed Disks, Microsoft provides the following four types.
The above table displays the maximal performance offered by managed disks within an individual tier. Within each managed disk tier, Azure also defines performance granularities based on disk size. For example, based on the table below, a 1000 GiB managed Premium SSD volume will be part of P30 range and will be charged $135.17 per month and Provisioned IOPS and Throughput will be capped to 5,000 IOPS and 200 MB/s, respectively.
Volumes Performance and Modification Characteristics
Cloud block storage volumes have shared characteristics; different tier types provide performance capabilities with different expressions.
To choose an optimal configuration for a volume, we need to understand them before making any decisions.
- Volume elasticity
For most tier types, a user can only specify volume size as needed. Based on the tier constraint, with a given size, the tier can provide a certain amount of IOPS or Throughput. For example, Amazon EBS gp2 IOPS baseline performance scales linearly at 3 IOPS per GiB of volume size. Similarly, for Azure Managed Premium SSD, P10 (128 GiB) provides 500 provisioned IOPS and 100 MB/s throughput, while P15 (256 GiB) can provide 1,100 provisioned IOPS and 125 MB/s throughput. This special characteristic for storage tiers provides the opportunity for volumes to acquire higher IOPS or Throughput capability, by simply increasing volume size.
Amazon EBS IO1 and IO2 provide elasticity for both volume size and IOPS, and the configured IOPS guarantees sustained IOPS performance. In addition to size and IOPS, a user can also configure throughput capability with AWS GP3 and Azure Ultra SSD tiers. However, users must ensure that the IOPS and throughput values to be configured are not arbitrary, as there are internal defined limitations with volume size.
- SSD and HDD tiers
AWS and Azure offer SSD and HDD tiers for cloud volumes.
SSD tiers are applicable to small I/O size where the dominant performance attribute is IOPS, while HDD tiers are good at large I/O size and sequential I/O. When choosing a tier, the volume can be categorized as SSD or HDD depending on application demand.
Cloud volume is charged by the amount of provisioned resources until the storage is released. Depending on tier type, there can be additional costs other than provisioned resources. For example, AWS standard tier also charges based on consumed I/O requests, and Azure Ultra charges reservation cost if you enable Ultra Disk compatibility on the VM without attaching an Ultra Disk. For more details of cloud storage cost in different regions, refer to Amazon EBS pricing page and the Azure Managed Disk pricing page.
- Volume modification: Elasticity and Downtime
Once the volume has been provisioned, users can still make further modifications as they deem necessary. Before making any change to the volume, it is important to understand the nuances of elasticity and downtime related to volume modification.
Increasing volume size is irreversible in both AWS and Azure, and volume size can’t decrease once configured. Users also need to extend the file system to allow the OS to use the newly allocated size.
Changes can be reverted to original configuration
Changes cannot be reverted to original configuration
Changing Storage Tiers, Changing provisioned IOPS and Throughput
Increasing Volume size
Changing Storage Tiers, Changing provisioned IOPS and Throughput
Increasing Volume size
- For AWS EBS volumes, almost all actions do not require downtime, however, users need to wait at least 6 hours and ensure that the volume is in the ‘in-use’ or ‘available’ state before new modifications.
- For Azure Managed Disks, changing size is executable only when volumes are unattached or the VM is deallocated (e.g., stopped), which usually requires detaching then reattaching the volume, or deallocating then starting the VM along with the modification.
Workload needs to be stopped in order to make changes (see note for Azure below)
Workload does not need to be stopped in order to make changes
Changing Storage Tiers, Increasing Volume size, Changing provisioned IOPS and Throughput
Changing Storage Tiers, Increasing Volume size
Changing provisioned IOPS and Throughput for Ultra Disks
To help users understand action disruptiveness and reversibility, Turbonomic displays these two attributes for each cloud volume scale action.
Other important constraints
In addition to previous characteristics, there are some special constraints that users need to pay attention to. For example, AWS boot volumes can’t be on the ST1/SC1 tier; Azure Ultra disks can only be used as data disk; Azure premium SSDs can only be used with VM series that are premium storage-compatible.
Finding the Optimal Configuration for Cloud Volumes
Turbonomic’s analytics engine understands the details for every cloud storage tier, tier performance characteristics, as well as pricing. Turbonomic also records IOPS and Throughput utilization data for cloud volumes and uses techniques such as percentile values to help assess volume performance based on historical data. Based on the above elements, Turbonomic will find the desired configuration for each individual volume, which will satisfy its performance, and minimize costs in most cases.
For example, the below image shows a performance scale volume action generated by Turbonomic for an Amazon EBS io2 volume configured with 2,152 IOPS.
From the action details, we can easily tell that the volume is congested on IOPS while Throughput is underutilized, which drives the action to acquire higher IOPS capacity for the volume. In addition, executing the action doesn’t require VM downtime, and is reversible. And, you can easily automate this action through Turbonomic without logging in to AWS console.
Based on history and current utilization, it is recommended that the volume should scale to the gp3 tier with 3,075 IOPS and 125 MB/s, which will solve the volume’s IOPS congestion issue, and meanwhile save $162 per month on cost.
Let that sink in, a performance problem is eliminated and resulted in savings!
In the below example, you can see an idle Azure Managed Premium data disk without any read/write activity for a long time.
Turbonomic generated an efficiency action for the volume, to scale down to managed Standard HDD tier with $94 savings per month. Within action details, it is easy to tell that this action is disruptive. Executing the action will trigger a series of actions to deallocate the VM (if the VM is running), modify the volume, and start the VM (if VM previous state was ‘running’). This action is reversible, which means that if this volume becomes busy with read/write operations, Turbonomic will detect that and may scale the volume back to managed Premium SSD tier, depending on volume’s IOPS and Throughput demand.
The Importance of Elasticity and Downtime in Automation Workflow
Actions that are non-disruptive and reversible have the least impact on workloads and are most amenable to Automation. On the other hand, disruptive actions should be scheduled during a maintenance window to prevent downtime for end users.
Turbonomic allows users to define Automation policies for each combination of Elasticity and Downtime categories:
For each of the 4 categories, the Action Acceptance can be defined as Automatic, Manual or Recommend.
Furthermore, disruptive actions (which need to be executed during a maintenance window) can be executed using an Execution Schedule:
Maximize Savings vs. Better Reversibility
The Maximize Savings mode selects the most cost-effective configuration for a Volume to assure performance. Oftentimes, this involves increasing the Volume size to gain more IOPS or Throughput.
The Better Reversibility mode prioritizes actions that are reversible over irreversible changes. Increasing a volume size is an irreversible change. Actions to increase volumes size can still be generated in this mode, if that is the only possible change to assure performance by gaining the desired IOPS or Throughput.
Storage Tier Inclusion/Exclusion Policies
Customers may have business requirements to keep Volumes on a certain set of Storage Tiers to assure a baseline performance or to avoid excessive cost. Both of these requirements can be achieved by selecting the Storage Tiers for Scaling in Volume Policy.
For the last few years, organizations have taken cloud scaling more seriously. While predominantly the focus has been on scaling PaaS services or rightsizing VMs, this is just the beginning of unlocking cloud elasticity. The ability to scale cloud resources at all layers of an application stack will further minimize performance risks and lower costs, and storage is a fundamental building block of our global IT infrastructure. Unlocking the benefits of cloud elasticity can be a challenge - knowing when to scale, how to scale, and doing it across thousands of devices can prove impossible. We love unlocking the cloud’s true potential here at Turbonomic, and we hope you’ll join us in exploring a new world of automated elasticity.