For several years we use HNAS systems here at the University: 4 HNAS 4080 nodes (clustered) and 2 HNAS 3080 nodes
Last couple of weeks these systems suffered from unexpected reboots..
HDS support investigated the root cause of these crashes and in most cases the root cause was a "transient error caused by a single event upset".
- HNAS uses field programmable gate arrays (FPGAs), which enable hardware performance with the ability to reprogram functionality.
- FPGAs are hardware chips that are programmed (on boot, every boot) to perform multiple operations in a single clock cycle.
- FPGAs store run-time code in a special area, called configuration RAM (or CRAM). CRAM is implemented as static RAM (SRAM) using complimentary metal-oxide semiconductor (CMOS) technology that holds electrical charge to store FPGA code.
- As such, SRAM is susceptible to single bit errors, similar to other components, that hold electrical charge, such as dynamic random-access memory (DRAM).
- The FPGA’s CRAM in HNAS is protected by a cyclic redundancy check (CRC) mechanism to protect against single bit errors. This CRC mechanism is calculated on and checked by “frames of data” continuously flowing through the FPGA.
- When the CRC checker detects a single bit error, it generates a severe “assert” condition.
- HNAS initiates recovery and reboots the HNAS server in order to reprogram the FPGA and correct the single bit error.
Statistically such an event could incur for each FPGA in order of 1 every 6 years (as stated by HDS support).
We've had such transient errors and reboots before but never with the frequency we saw last weeks (and spread over several different HNAS systems that are spread over two datacenters).
HDS engineering states that these events are within specification and do not require any further action.
My question to other users of HNAS systems: do you experience such "transient errors" and reboots and what is the frequency/pattern that you see ?