We're facing a odd issue here.
We have a Hitachi VSP and several linux servers connected via brocade switches. We also have a Quantum Scalar with 10 tape drives Ultrium 5-SCSI. Each linux server maps all ten drives. For load balance and failover we're using linux multipath.
Say that, by third time now, our SCALAR having some operational problems, like:
SR Problem Summary: 652Q.GS00501; Control of Tape Drive at [1,1,1,6,1,1] has failed
SR Problem Code: Drives
SR Error Code: 09_09_01_00_00000000
SR Severity: 3
SR Notes: Failed configuration of [1, 1, 1, 6, 1, 1], primary port is disabled
The stranger thing is, when this error occur, some of Linux Servers running Oracle RAC reboots. Error is:
disk I/O is hanging for XXXXX msecs
And so, Oracle starts a node reboot to avoid "split brain" events.
We do not know why it happens, but these events are related (Scalar malfunction > disk hanging > Oracle reboots server)
After reboot, Linux servers stay running well (until new Scalar event).
Any clues? We are lost here. Our local Hitachi team were warned also.