Adjusting storage related timeout settings in PowerVC

View Only

Adjusting storage related timeout settings in PowerVC

By Archive User posted Mon June 12, 2017 06:59 PM

Like

Sometimes when you are using PowerVC, you may notice a deploy or specific storage operation fails after waiting...[drink coffee]... for the operation to complete. The request finally gives up and you see an error message. Why did that happen? Perhaps you are curious about what your timeout values are or you want to increase the wait time to give your storage array more time to complete its work. It turns out that there are a number of different settings that the underlying OpenStack infrastructure has available, and in this article we are going to dive into how to report and adjust those settings with PowerVC.

Background

In order to keep the PowerVC user interface streamlined, there are a variety of configuration options that are not exposed through the UI. However, PowerVC has a powerful and straightforward command line interface (CLI) to configure supported options that may be applicable to your environment. The command is called powervc-config and it gets enhanced each release with newly supported configuration settings.

I’ll begin by stating the caveat that PowerVC installs with configuration settings that are recommended for most customers. If PowerVC is working well for you, then it is generally best to leave the configuration settings alone. There are times however, like when working with non-IBM storage providers, when it may be appropriate to use powervc-config to adjust a storage configuration setting. This article will cover the storage timeouts configuration function of the command. The storage timeouts support was first added in PowerVC v1.3.2.1.

To see the storage configuration options of the command, specify the help option (-h) like this:

powervc-config storage –h

This will list the sub-commands and descriptions. The one we are interested in is:

timeouts            Configure storage timeout and interval settings

Viewing the help information for the timeouts subcommand works the same way:

powervc-config storage timeouts -h

The help gives a lot of information that could be overwhelming, but this article will show you how simple it really is to use the timeouts function. Keep in mind that the command’s simplicity does not negate the runtime complexity of handling timeouts between different entities and for different operations. Let’s take the example of cloning an image volume during virtual machine deployment. Under-the-covers, the flow looks a bit like this:

[caption id="attachment_1430" align="alignright" width="793"]Flow: Clone image volume for deployment[/caption]

Steps in the diagrammed flow:

The nova compute host sends the clone request to Cinder.

The Cinder API service asks the scheduler for the appropriate volume driver to handle the request.

A new volume record is created in the database and the request is sent to the volume driver (e.g. EMC VMAX driver) on a new worker thread.

Cinder responds to Nova that the request has started. At this point, nova-compute starts polling for the completion of the clone operation. The poll isn’t satisfied until step 8.

The clone request is sent to the SAN array. If capable of a FlashCopy, the array can initiate the copy operation and return right away.

For other storage providers like EMC VMAX, the volume driver will poll for completion of the copy before it returns the result to the Cinder API service.

The volume driver responds to the request from step 3 that it is complete. The volume status is updated in the database to say that the clone is complete and the volume is created on the back end.

On the next poll attempt from the Nova compute service, it finds the volume created and it continues with the rest of the deploy tasks, including volume attachment.

Any of these communication channels can time out. For example, there could be a network issue or there could be many concurrent operations so that there are no available worker threads to process the request (step 3). But the most common cause for timeout in this sequence of steps and with a storage array like EMC VMAX, is that large image volumes take longer to finish the full copy operation on the array than the compute service is configured to wait in step 8. When this happens, the deploy fails and it may get rescheduled to a different compute host. In all likelihood, it will fail there too because the problem is not with the compute host, but is with the configured storage timeout.

An Example Scenario

Here is a typical storage message that you’ll see for the situation described earlier with a large image volume and no FlashCopy support:

Deploy of virtual machine Stellar_VM on host c387fffff-9117-MMB-SN2120FCA failed with exception: Build of instance 3c0bda9a-b60d-4de8-8da6-7aefb72250a4 was re-scheduled: Volume a3953591-f083-40c8-8759-a4a535b949dc did not finish being created even after we waited 1878 seconds or 601 attempts. And its status is creating.

So it failed after waiting just longer than 30 minutes. Let’s confirm the current configuration settings bear this out. Run the powervc-config storage timeouts command without any options to print out the current timeout configuration. In the terminal output below, I just show a snippet of the output for the deploy_clone type:

[~]# powervc-config storage timeouts
...
Current settings for 'deploy_clone' timeouts...
    For Compute defaults:
        block_device_allocate_retries_interval = 3 [default is 3]
        Effective Timeout In Minutes = 30 [default is 30]
    For host 824742L_2120FCA [display_name: Rossak]:
        block_device_allocate_retries_interval = 3 [default is 3]
        Effective Timeout In Minutes = 30 [default is 30]

So this confirms the 30 minute timeout. There is always some overhead to polling, so that is why the message above gives a time that is longer than 30 minutes. If the --verbose option is provided in the command, then it will list the underlying configuration option for each timeout “type”, if any:

[~]# powervc-config storage timeouts --verbose
...
Current settings for 'deploy_clone' timeouts...
    For Compute defaults,  Section = [DEFAULT]:
        block_device_allocate_retries = 600 [default is 600]
        block_device_allocate_retries_interval = 3 [default is 3]
        Effective Timeout In Minutes = 30 [default is 30]
    For host 824742L_2120FCA [display_name: Rossak],  Section = [DEFAULT]:
        block_device_allocate_retries = 600 [default is 600]
        block_device_allocate_retries_interval = 3 [default is 3]
        Effective Timeout In Minutes = 30 [default is 30]

This extra information explains how the "Effective Timeout In Minutes" was arrived at ((600 * 3) / 60 = 30 minutes) and why I saw “601 attempts” in the failure message: 600 poll attempts per the block_device_allocate_retries value, happening at three second intervals, will add up to 1800 seconds or 30 minutes (plus overhead). After this many attempts, the next retry (601) results in a timeout.

From a separate error message in the UI, I can find the name of the volume, so I have a conversation with my storage administrator. They tell me that the volume did indeed get created and that it was 75% written out before it was deleted. Deleted? What the heck!? Yep, the deploy request failed (timed-out), so PowerVC rolled things back and cleaned up the volume.

We need to try increasing the timeout. Based on what my storage admin said, I’m going to try increasing the timeout to one hour as that should be adequate.

[~]# powervc-config storage timeouts --types deploy_clone --minutes 60 --restart
Changing the 'deploy_clone' timeouts...
    For Compute defaults:
        block_device_allocate_retries_interval is unchanged.
        Effective Timeout In Minutes = 60
    For host 824742L_2120FCA [display_name: Rossak]:
        block_device_allocate_retries_interval is unchanged.
        Effective Timeout In Minutes = 60
    For host 824742L_212143A [display_name: Hagal]:
        block_device_allocate_retries_interval is unchanged.
        Effective Timeout In Minutes = 60
Restart of nova services requested.
Finished.

The only type of timeout I care about here is deploy_clone. I give it the --minutes option and the --restart option. Use --restart to make the new settings take effect. The necessary services will be restarted. This works just fine with NovaLink-managed hosts as well. The configuration timeouts will be updated and the compute service will be restarted remotely on those hosts. I could have specified the --interval_seconds option on the command to increase the seconds from 3 to 20, for example, if I didn’t want it to poll so often. The command will compute and set the correct retries value for me (60 minutes is 3600 seconds divided by a 20 second interval = 180 retries in this example). But I don’t really care about the polling interval, and so I try the deploy again. It succeeds!

The General Case

What about when you just suspect a timeout problem, but you don’t know where it's happening and you don’t want to go digging in the logs for clues? That’s fine. You can use the CLI to increase all the supported timeouts to one consistent value and see how things behave. It’s not going to hurt anything. It’s just going to take longer for things to time out in the event that they truly get hung up somewhere. Just use the command’s simple form to update all the timeout settings (i.e. don’t provide the --types option or use the special “all” value for --types):

powervc-config storage timeouts --minutes 120 --restart

In this example I’m telling PowerVC to update all the effective timeouts to two hours. After the services restart, give them a minute or two to get their bearings, and then try the operation again. Monitor the messages in PowerVC and the time stamps on the messages will tell you how long things took. If the operation succeeds, but is significantly less than the timeout value you provided, you can always dial it back.

The next question that comes up is what about setting the minutes timeout value to some extremely high number, like a week’s worth of minutes? First, that’s not really a good idea from the standpoint of how these options are intended to be used. Second, the entire deploy process is gated by the instance_build_timeout configuration setting. By default, PowerVC sets this to 2 hours. All deployment operations have to complete in under this timeout for a successful deploy, so there is really no point in setting the storage timeouts to be greater than 120 minutes because in this case, the overall timeout is in play.

If you’ve used the command some to modify the timeout settings, but now want to return the timeouts to their default values, just run the command again to list the current settings and note the default settings that are also listed. For instance, in the example output shown in the previous section, we see that the output shows Effective Timeout In Minutes = 30 [default is 30]. Then run the command and specify the default values for minutes (e.g. --minutes 30) and --interval_seconds if that was changed.

Diving Deeper

If you really want to assign different timeout values to different timeout types, the command allows you to do that with the --types option, which takes a list of timeout types separated by spaces. This article won’t get into all the details of that, but I will provide the basic help information here for completeness as well as some example error messages that might indicate too small a timeout setting for that type.

Timeout type	Description	Possible related symptom
deploy_clone	Timeout for cloning a volume during deploy or when creating a new volume during attach.	“…did not finish being created even after we waited <seconds> seconds or <retries> attempts. And its status is creating.”
capture_clone	Timeout for cloning a volume during capture.	“The allocation of the image volume either failed or timed out: image volume <volume_id> tries=<tries>”
cinder_http	How long Nova compute waits for non-polled storage operations like attach, when talking to Cinder.	“Build of instance <UUID> was re-scheduled: Request to https://<REST_PATH>//volumes/<volume_uuid>/action timed out (HTTP 408)”
cinder_rpc	How long the Cinder API service waits for the volume driver to respond to certain requests.	Cinder api.log: “Timed out waiting for a reply to message ID”
vmax_smis	How long the VMAX driver waits for responses from the SMI-S Provider. The VMAX driver is special in how it polls for job completion.	Cinder volume driver log: “The job has not completed and is in a <state> state.” Or: “Synchronization (clone) for <identifier> timed out after… "
manage_volume*	Timeout for managing existing volumes. These timeouts apply to both available volumes and attached volumes when managing virtual machines.	After waiting for the configured timeout, you get a pop-up error in the GUI or message saying "Unable to retrieve available volumes on the storage provider."

* The support for the manage_volume type was added in PowerVC v1.3.3.0

As you can see, the powervc-config storage timeouts command is a powerful tool to automatically update and reload a variety of timeout settings at once. If you have any questions or comments, feel free to post them below. We’d love to hear from you! And don’t forget to follow us on Facebook, LinkedIn, and Twitter.

#configuration
#Storage
#Storage
#troubleshooting
#timeout
#volume
#powervc

0 comments

5 views

IBM Power

Connect, learn, share, and engage with IBM Power.

PowerVC