Flash Storage

View Only

Introduction to VSP 5200 and 5600 Architecture

By Sudipta Kumar Mohapatra posted 04-11-2022 19:46

Introduction

The recently upgraded Hitachi enterprise Virtual Storage Platform (VSP) 5000 series features industry-leading performance and availability. The VSP 5000 series scales up, scales out, and scales deep. It features a single, flash-optimized Storage Virtualization Operating System (SVOS) image running on as many as 240 processor cores, sharing a global cache of up to 6 TiB. The VSP 5000 controller blades are linked together by a highly reliable PCIe switched network, featuring interconnect engines with hardware-assisted Direct Memory Addressing (DMA). In addition, the VSP 5000 cache architecture has been streamlined, permitting read response times as low as 39 microseconds. Improvements in reliability and serviceability allow the VSP 5000 to achieve industry-leading 99.999999% availability. In this blog, we’ll take a brief look at the highlights of the VSP 5000 series architecture.

SCALE UP, SCALE OUT, SCALE DEEP

The entry-level VSP 5200 offers up to 5.1 million IOPS from two controllers (40 cores), with 99.9999% availability. The VSP 5200 can be non-disruptively scaled up to the VSP 5600-2N (four controllers, 80 cores), offering industry-leading 99.999999% availability. The VSP 5600-2N scales out to the VSP 5600-4N and VSP 5600-6N, as shown in Figure 1. The VSP 5600-6N is capable of up to 33 million IOPS and can attain up to 310 GB/s of front-end bandwidth (and 150 GB/s of sequential throughput). All VSP 5000 models, which include earlier offerings such as 5100, 5500, and the newly launched 5200, 5600 models, can scale deep by virtualizing external storage.

Figure 1 New VSP 5200 and 5600 upgrade chart

Figure 1: New VSP 5200 and 5600 offer flexible configuration options with industry-leading performance and availability

SINGLE SVOS IMAGE, GLOBAL CACHE

Similar to previous Hitachi enterprise products, all VSP 5000 processors run a single SVOS image and share a global cache. Dedicated cache boards have been eliminated, and cache is now distributed across individual controllers for fast memory access and better resiliency. However, cache remains a global resource and is accessible by each controller over a PCIe switched network using a hardware-assisted DMA. The DMA hardware assist is performed in Field Programmable Gate Arrays (FPGAs) in each controller HIE interface. The new DMA implementation reduces CPU overhead and improves the performance for inter-controller transfers. Figure 2 shows how the VSP 5000 components of (a four-node system in this example) are connected. Each board (CHB, DKB, HIE) is connected to a controller using 8 x PCIe Gen 3 lanes, and therefore has 16 GB/s of available bandwidth (8 GB/s send and 8 GB/s receive). Each controller has up to 64 GB/s of front-end bandwidth (provided by four CHBs), 32 GB/s of back-end bandwidth (two DKBs), and 32 GB/s of interconnect bandwidth (two HIEs).

Figure 2: Example of a four-node and eight-CTL high-level VSP 5000 architecture

HARDWARE COMPONENTS

The basic VSP 5200 and 5600 hardware components are unchanged when compared to the original VSP 5000 series. For example, CHBs, DKBs, HIE adapters, and drive boxes remain the same.

Up to four 4-port 8/16/32 Gb FC CHBs can be installed per controller.
Two DKBs must be installed per controller on systems with drive boxes.
Each SAS DKB has two ports that connect to drive boxes over 4 x 12 Gbps links per port.

However, several key components have been upgraded or given new capabilities as follows:

Each VSP 5200/5600 controller board is now powered by two advanced 10-core Intel Cascade Lake CPUs, operating at 2.2 GHz.
Two Compression Accelerator Modules per controller have been added, resulting in 3x more IOPS with improved ADR performance.
FC CHBs now support both SCSI and NVMe protocols.
iSCSI and FICON CHBs are available.

For more information on open systems CHBs, controller boards, and SAS DKBs, see the VSP Gxx0 and Fxx0 Architecture and Concepts Guide. First, let’s look at the components that have been improved in VSP 5200 and VSP 5600.

VSP 5200/5600 CONTROLLER

In the latest VSP 5000 models, two 10-core Cascade Lake CPUs function as a single 20-core MPU per controller. Each controller has eight DIMM slots into which 64 GB DDR4 DIMMs are installed, for a maximum of 512 GB cache per controller, as shown in Figure 3. Each controller has two HIE adapters that connect to a high-speed PCIe switched network as discussed in the Hitachi Accelerated Fabric blog. The Cascade Lake CPU in the new controller has the same number of cores and the same clock frequency as the original Broadwell CPU, as shown in Figure 4.

So why has GPSE testing attained significantly higher performance when using the upgraded controllers, even with non-ADR workloads in which the new compression accelerator is not a factor? The primary reason is the enhanced memory architecture of the Cascade Lake CPU that incorporates two memory controllers per CPU instead of one in the original Broadwell CPU. The extra memory controller reduces the contention for memory access. All I/O goes through the CPU memory (either cache or the DXBF transfer buffer), enabling faster memory access and subsequently improving the I/O performance. Another significant improvement is the increase from 40 PCI Express Gen3 lanes per Broadwell CPU to 48 lanes per Cascade Lake CPU. The additional eight lanes connect each Cascade Lake CPU to a Compression Accelerator Module, as shown in Figure 3.

Figure 3: Upgraded VSP 5000 controller

Figure 4: 10-Core CPU comparison

COMPRESSION ACCELERATOR MODULE

Each VSP 5200/5600 controller features two compression accelerator modules. As observed in GPSE testing, the compression accelerator improves the ADR performance by as much as 2X, while also boosting capacity savings. The compression accelerator module allows the CPU to offload the work of compression to a Hitachi-designed Application-Specific Integrated Circuit (ASIC). The ASIC uses an efficient compression algorithm that is optimized for implementation in specialized hardware. The compression accelerator operates on the data in cache using DMA and does not require copy operations. Therefore, it can work with very low latency. The compression accelerator module is connected to the controller using eight PCI Express Gen3 lanes, as shown in Figure 5. Within the module, a PCIe switch connects four lanes to each of the two ASICs per compression accelerator module. The compression accelerator occupies unused space in the fan module (two per controller), so each controller gets four compression ASICs.

Figure 5: Compression Accelerator Module block diagram

FRONT END CONFIGURATION

The default mode for VSP 5000 FC ports is target only. This mode supports a command queue depth of 2,048 per port, for compatibility with VSP G1500 storage systems. The VSP 5000 also offers an optional bi-directional port mode (except for ports configured for NVMe-oF), under which a port can simultaneously function as a target and initiator, with each function having a queue depth of 1,024 (see Figure 6). The highest-performing VSP 5000 front-end configuration uses 100% straight access, in which LUNs are always accessed on a CHB port that is connected to the controller that owns the LUN. Addressing a LUN on the non-owning controller (known as front end cross I/O) incurs minimal additional overhead for each command. However, GPSE testing shows that front-end cross I/O does not have a significant performance impact under normal operating conditions (up to about 70% MP busy). We do not recommend configuring to avoid front-end cross I/O unless you require the highest possible levels of performance because it has lower availability than multi-path I/O. The CPK sizing tool estimates performance based on the assumption that 50% front-end cross I/O will occur.

NVMe-oFC

VSP 5000 CHBs now support the NVMe protocol over fibre channel networks. This allows you to take advantage of flash-optimized NVMe features using an existing fibre channel infrastructure. NVMe over fibre channel offers the following performance advantages:

The streamlined NVMe driver stack reduces host CPU system time per I/O, potentially reducing server costs and increasing the host I/O potential.
The NVMe multi-queueing architecture allows the host to initiate more concurrent I/O to each LUN (or namespace in NVMe nomenclature).
GPSE testing confirms that these two features increase the performance potential per LUN, allowing high throughput with fewer host-facing devices than would be required with the SCSI protocol.

Figure 6 : VSP 5000 Bi-directional port functionality

BACK-END CONFIGURATION

For optimal performance, the first level of drive boxes connected to the DKBs must be either an SBX (DBS2 x 4) or an FBX (DBF3 x 4). The four SAS expanders in DBS2 / DBF3 allow any type of drive chassis to be installed in level 2~n, as shown in Figure 7. With this configuration, any of the four controllers in the CBX pair can access any drive in the CBX pair without any performance degradation because of inter-CTL overhead.

Figure 7: Either an SBX (4 x DBS2) or an FBX (4 x DBF3) must be installed in level 1

In VSP 5600 (similar to VSP 5500) storage systems with more than one CBX pair (two CBX pairs in 4N systems and three CBX pairs in 6N systems), back-end cross I/O can occur. Back-end cross I/O occurs when the drive being accessed is in a different CBX pair from the controller that owns the target DP-VOL. Additionally, back-end cross I/O requires extra HIE/ISW transfers, and therefore may degrade performance. Fortunately, there is a new intelligent HDP algorithm that reduces the frequency of back-end cross I/O to the point where it has little or no effect on performance. Briefly, the new system allocates a DP-VOL page in the same CBX pair as the controller that owns it, as shown in Figure 8. The new algorithm only applies to flash drives, because cross I/O is not crucial for spinning disk performance.

Figure 8: HDP intelligent page allocation

As mentioned previously, each SAS DKB has two ports that connect to drive boxes over 4 x 12 Gbps links per port. This yields a total of 64 x 12Gbps SAS links per CBX pair. Therefore, each CBX pair has the same number of back-end paths as a single-chassis VSP G1500 with the high-performance back-end option. Because of the increase from 6 Gbps to 12 Gbps links, the theoretical SAS bandwidth per CBX pair is twice that of the VSP G1500 single chassis (768 Gbps vs. 384 Gbps). To maintain enterprise-level performance and availability, the VSP 5000 also requires fixed slot assignments for building parity groups as shown in Figure 9. RAID levels 14D+2P, 6D+2P, 7D+1P, 3D+1P, and 2D+2D are supported.

Figure 9: Example of the parity group slot assignment system for highest performance and availability

RESILIENCY ENHANCEMENTS

Resiliency enhancements allow the VSP 5000 to achieve industry leading 99.999999% availability. These improvements include:

An 80% reduction in the rebuild time for flash drives.
A more robust cache directory system that allows latency-sensitive host applications to stay up by avoiding write through mode.
Reserved areas on separate power boundaries that allow quick recovery of a redundant copy of shared memory during a hardware failure or power outage.
Resilient interconnect featuring four independent PCIe switches. Up to seven X-path cables can fail without the risk of taking the system down, compared to two such failures on the VSP G1500.
100% access to all drives in a CBX pair when up to three DKBs (out of 8) fail. (VSP G1500 can sustain two DKA failures).

PERFORMANCE ENHANCEMENTS

Significant performance enhancements in the VSP 5000 include the following:

A streamlined cache directory system that eliminates redundant directory updates and improves the response time.
An increase to 24 dedupe stores and 24 DSDVOLs per pool for improved multiplicity of I/O when deduplication is enabled.
Smaller access size for adaptive data reduction (ADR) metadata reduces the overhead.
ADR access range for good metadata hit rate expanded to ~2 PB per CBX pair.
Inter-controller DMA transfers and data integrity checks offloaded to HIE to reduce CPU overhead.
Support for NVMe drives allows extremely low latency with up to 5X higher cache miss IOPS per drive.
Support for ADR in an HDT pool (recommended for all flash configurations only with SCM and SSDs) enables capacity savings plus higher performance by accessing the busiest pages without ADR overhead.
Support for NVMe over fibre channel.

The VSP 5200/5600 also includes:

An upgraded Cascade Lake CPU.
Two Compression Accelerator Modules per controller, improving ADR IOPS performance by up to 3X.

MAINFRAME SUPPORT

The VSP 5000 supports mainframe architecture, similar to the previous generations of Hitachi enterprise storage. There are two notable differences in how the VSP 5000 handles mainframe I/O versus earlier enterprise products. First, the command routing is done by the HTP protocol chip on the FICON CHB instead of the VSP G1500 LR ASIC. Then, the CKD-FBA format conversion is now offloaded to the HIE interconnect engine. Because all mainframe I/O must go through the HIE for format conversion, there is no performance difference between front-end straight and front-end cross I/O on mainframe.

We briefly reviewed the highlights of the VSP 5000 architecture, including the latest enhancements introduced in the VSP 5200 and 5600. For additional information on enterprise VSP storage systems, see the GPSE Resource Library and the VSP 5000 connect page.

#VSP5000

#EnterpriseStorage
#VSP5600
#HitachiVirtualStoragePlatformVSP

4 comments

293 views

Permalink

Comments

Stefano Serafini

01-10-2024 11:02

Great post, very helpful

Dipak Singh

11-06-2022 09:02

Informative!

Tanmoy Panja

06-15-2022 13:18

Very Helpful

Chayan Sarkar

05-02-2022 03:45

Great post

Flash Storage​