AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.


#Power
#Power
 View Only
  • 1.  AIX 5.3round_robin failure/reconfiguration not working?

    Posted Mon February 04, 2008 09:24 AM

    Originally posted by: SystemAdmin


    Hello all!

    I am currently running AIX 5.3 connected to DataDirect storage through Brocade 4900-series switches. We have round_robin working wonderfully over two HBAs through the switches to the storage, but here is what we are seeing that has me concerned. When we manually pull the fiber connection for one path (to test - we also disable/reboot the Brocade as part of testing) the path goes offline as expected and I/O runs flawlessly over the second/backup path. However, upon port enable/reboot finishing/plugging the fiber back in, I/O only continues over the second path which was up and never goes back to round_robin mode (read: traffic never goes back over the path which we disabled/unplugged). Is there something that we need to set/be doing to ensure that the round_robin feature comes back online? thank you so much for your help!

    Cheers,
    Travis
    #AIX-Forum


  • 2.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Mon February 04, 2008 09:39 AM

    Originally posted by: tony.evans


    Is this using straight MPIO (I'm not familiar with the underlying gear you're using). If so, I believe you have to chpath the devices back online.

    However, we use SDDPCM / SDD everywhere so I'll admit I'm not 100% familiar with the technology in this instance.
    #AIX-Forum


  • 3.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Mon February 04, 2008 12:30 PM

    Originally posted by: SystemAdmin


    Tony:

    Hello and thank you very much for the reply! We did not try chpath, but did try cfgmgr (and that didn't work). Also, yes, we are using a vanilla MPIO setup on the server. Interestingly, the path did come back into the round_robin configuration after a reboot. I will be testing later this afternoon and we will give the chpath command a try and hopefully this is what we need to fix our issue. Thank you again for your help!!!

    Cheers,
    Travis
    #AIX-Forum


  • 4.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Mon February 04, 2008 02:23 PM

    Originally posted by: SystemAdmin


    Success! We tried using the chpath command, but were greeted each time with an error message indicating that the device was not in a known status (Error: 0514-029). So here is what we did:

    1. Ran a large amount of fresh I/O to confirm round_robin was working (we have dynamic tracking and fast fail enabled as well) through the two HBAs. Both were flashing like Christmas tree lights. I then removed the connection for FCS0/FSCSI0 from the Brocade. I/O paused for 15 seconds and then resumed over the one remaining available path (FCS1/FSCSI1). We waited 5 minutes and ran more I/O through the one remaining path. We then plugged FCS0/FSCSI0 back into the Brocade port and it just sat there. We waited a good 5 minutes and nothing. Then we tried the chpath command and got the error above (we tried command line and smitty to see if we could fix it). We then started a fresh set of heavy I/O over the SAN and ran the following commands:

    # rmdev -l fcs0 (the output from this command says that the device is defined and that is about it)
    # cfgmgr

    Within 15 seconds, FCS0/FSCSI0 was back online and pushing through I/O and back in the round_robin rotation. We then let I/O complete and started a fresh push of data and sure enough, FCS0/FSCSI0 worked like new with no problems. We arrived at the solution based off work we did earlier today to get dynamic tracking and fast_fail to work. Each time we tried the following command:

    # chdev -l fscsi0 -a dyntrk=yes or the command
    # chdev -l fcsci1 -a fc_err_recov=fast_fail

    We got the following error message:

    Error: 0514-029 Cannot perform the requested function because a child device of the specified device is not in a correct state.

    After some research the solution on our dynamic trunking and fast_fail problems was to run the following command:

    # rmdev -l fcs0

    And then follow that command with the chdev commands and it worked like a champ. I am new to AIX (SAN guy by trade) so I am not really sure what was actually going on behind the scenes, but it worked out great. Hope this helps someone at some point and thanks for the feedback!

    Cheers,
    Travis

    Message was edited by: JackTheRipper
    #AIX-Forum


  • 5.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Tue February 05, 2008 04:47 AM

    Originally posted by: tony.evans


    What chpath did you try?

    chpath -l hdiskx -p fscsi0 -s enable
    #AIX-Forum


  • 6.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Tue February 05, 2008 08:53 AM

    Originally posted by: SystemAdmin


    Tony:

    Sorry for the delay in response (I am on the East coast and had left for the day). We tried two versions of the chpath command:

    # chpath -l hdiskx -p fscsi0 -s enable

    and

    # chpath -l fcs0 -p fscsi0 -s enable

    We got the error I posted up above on the first one and a different syntactical error on the second command as I don't think you can use fcs0 in there (but we decided to give it a try anyway). We are set for more testing today (go figure :-) ) and before I do the rmdev I will give the chpath another go to make sure I wasn't missing something. Thank you again for the feedback!

    Cheers,
    Travis
    #AIX-Forum


  • 7.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Thu February 07, 2008 08:55 AM

    Originally posted by: tony.evans


    Ok, I don't have a test system with native MPIO and the only box I do have running it is in production so I can't play with the commands atm. I'm sure that chpath should work in theory.

    Having looked further, you should be able to get MPIO to bring the paths back online manually as well.

    smitty mpio
    MPIO Device Management
    Change/Show MPIO Device Characteristics

    pick a device
    HealthCheck Mode nonactive
    HealthCheck Interval 0-3600 sec [0]

    Set those to the relevant values for each device to enable path recovery, i think.
    #AIX-Forum


  • 8.  Re: AIX 5.3round_robin failure/reconfiguration not working?

    Posted Wed March 05, 2008 02:29 PM

    Originally posted by: SystemAdmin


    Tony, awesome! We were still not able to get chpath to work like we wanted it to, but we are up and running and the increase in throughput has been very noticeable. I will be going into the data center to check out the smitty approach to having the paths brought back online manually and thank you so much for your help!

    Cheers,
    Travis
    #AIX-Forum