Legacy HDS Forums

Temp disk errors in AIX cause oracle to hang and dlm commands to hang

Discussion created by Legacy HDS Forums on Jun 7, 2009
Latest reply on Jul 4, 2009 by Legacy HDS Forums

Originally posted by: PP BIJU KRISHNAN



I have been having sleepless nights for the past few months due to the following issues

Common System specs

AIX 5.3 TL06 and TL08
HDLM 5.81.2.1

USP-V

SAN - Brocade 4100/4900/48K - FOS 6.1.0c

--------------------------------------------------------------------------------------------------------------
The foll type of AIX errors cause oracle to hang

DCB47997   0605201809 T H hdisk42        DISK OPERATION ERROR
DCB47997   0605201809 T H hdisk31        DISK OPERATION ERROR
DCB47997   0605201809 T H hdisk38        DISK OPERATION ERROR
DCB47997   0605201809 T H hdisk37        DISK OPERATION ERROR
DCB47997   0605201809 T H hdisk13        DISK OPERATION ERROR
DCB47997   0604070809 T H hdisk125       DISK OPERATION ERROR

LABEL:          SC_DISK_ERR4
IDENTIFIER:     DCB47997

Date/Time:       Thu Jun  4 07:08:17 CUT 2009
Sequence Number: 1095955
Machine Id:      00CE6CDE4C00
Node Id:         pat2047
Class:           H
Type:            TEMP
Resource Name:   hdisk250
Resource Class:  disk
Resource Type:   Hitachi
Location:        U7311.D20.6555C3B-P1-C03-T1-W50060E800545321A-L4A000000000000
VPD:
        Manufacturer................HITACHI
        Machine Type and Model......OPEN-V
        Part Number.................
        ROS Level and ID............36303033
        Serial Number...............50 04532
        EC Level....................
        FRU Number..................
        Device Specific.(Z0)........00000332CF000002
        Device Specific.(Z1)........3D14 2L ....
        Device Specific.(Z2).........0==
        Device Specific.(Z3).........
        Device Specific.(Z4).........Z..
        Device Specific.(Z5)........
        Device Specific.(Z6)........

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0A00 2800 0D87 FD20 0000 1004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0017 123C 0007 4500 0000 001D 0000 0000 0000 0000 0000 0000 0000
0000 0035 001D

The dlm commands also hang at the time these errors are reported.

HDS : Identified a storage port connection on the principal switch which seems to be slow draining, depleting credits, and possibly causing these temp errors. HDLM is behaving as per design.

Brocade : No problems seen with the SAN and might require FC analyzer

IBM : Errors are due to retries - possible problems with SAN - system may hang due to high wait IO during the time. TL08 fixes IO hang issues and they need crash dump to analyse further.

Oracle - Oracle is waiting for IO response which it does not get, hence it hangs and then dies after a few minutes.

------------------------------------------------------------------------------------------------------------

Today a flapping ISL port caused a AIX system with 4 paths (4 HBA's), caused ORA errors and died. This is the 2nd time this has happened with a AIX system in my landscape.

I disabled the ISL port (part of 4 ISL's in a trunk and there are 2 trunk sets) and the host was rebooted, things are normal since then.

--------------------------------------------------------------------------------------------------------------

Question is

1. Is anyone in seeing similar issues?? Hosts and application hanging inspite of redundancy.

2. Do you see increased rate of such failures in a Brocade SAN with 6.0 and above. I somehow feel such incidents are on the rise post 6.1.0c upgrade

3. Any suggestions would be much appreciated, since I wouldnt be able to sleep untill I get the solution.

Thanks for your time and patience.

Regards,
Biju Krishnan

Outcomes