AnsweredAssumed Answered

VSP disk I/O is hanging and linux cluster reboot

Question asked by Wagner Marroni on Oct 11, 2016
Latest reply on Jan 2, 2017 by Erwin van Londen

Dear friends,

 

We're facing a odd issue here.

 

We have a Hitachi VSP and several linux servers connected via brocade switches. We also have a Quantum Scalar with 10 tape drives Ultrium 5-SCSI. Each linux server maps all ten drives. For load balance and failover we're using linux multipath.

 

Say that, by third time now, our SCALAR having some operational problems, like:

 

SR Problem Summary: 652Q.GS00501; Control of Tape Drive at [1,1,1,6,1,1] has failed

SR Problem Code: Drives

SR Error Code: 09_09_01_00_00000000

SR Severity: 3

SR Notes: Failed configuration of [1, 1, 1, 6, 1, 1], primary port is disabled

 

The stranger thing is, when this error occur, some of Linux Servers running Oracle RAC reboots. Error is:

 

disk I/O is hanging for XXXXX msecs

 

And so, Oracle starts a node reboot to avoid "split brain" events.

 

We do not know why it happens, but these events are related (Scalar malfunction > disk hanging > Oracle reboots server)

 

After reboot, Linux servers stay running well (until new Scalar event).

 

Any clues? We are lost here. Our local Hitachi team were warned also.

 

Thanks

 

 

 

 

Outcomes