The Virtual Storage Platform (VSP) E590 and E790 are the newest additions to Hitachi’s midsized enterprise product line, engineered to utilize NVMe to deliver industry-leading performance in a small, affordable form factor. The VSP E series (which also includes the VSP E990) features a single, flash-optimized Storage Virtualization Operating System (SVOS) image operating on Intel Xeon multi-core, multi-threaded processors. Thanks primarily to SVOS optimizations, the new VSP E Series models offer the highest performance available in a 2U form factor. In this blog, we will examine how Hitachi put industry leading NVMe performance into a very small package.
The VSP E series arrays leverage the NVMe storage protocol, which was designed to take advantage of the fast response times and parallel I/O capabilities of flash. NVMe provides a streamlined command set for accessing flash media over a PCIe bus or storage fabric, allowing for up to 64K separate queues, each with a queue depth of 64K. Harnessing the full power of flash with the NVMe protocol requires an efficient operating system and advanced processors, both of which are included with the latest VSP E series arrays. The new technologies included in the VSP E series are complemented by Hitachi’s sophisticated, flash-optimized cache architecture, with its focus on data integrity, efficiency, and performance optimization. Table 1 presents the basic specifications of the VSP E590 and VSP E790.
Table 1. Selected VSP E590 and E790 Specifications
|Feature||VSP E590||VSP E790|
|Maximum 8 KiB IOPS1||3.1 Million||6.8 Million|
|Maximum NVMe SSDs||24||24|
|Maximum Raw Internal Capacity||720 TB (24 x 30 TB SSD)||720 TB (24 x 30 TB SSD)|
|Maximum Raw External Capacity||144 PB||216 PB|
|Cache Capacity Options||384 GB or 768 GB||768 GB|
|Maximum Internal Cache Bandwidth||460 GB/s||460 GB/s|
|Maximum Number of Logical Devices||32,768||49,152|
|Maximum LUN Size||256 TB||256 TB|
|Maximum Host Ports||12 iSCSI, 24 FC||12 iSCSI, 24 FC|
|Data-at-Rest Encryption||Available ||Available|
1. 100% Random Read Cache-Hit
Before covering the hardware in more detail, let’s begin with a review of two important SVOS features that are common to all Hitachi storage products.
Hitachi Dynamic Provisioning provides a mechanism for grouping many physical devices (NVMe SSDs in the VSP E Series) into a single pool. The pool mechanism creates a structure of 42 MB pool pages from each device within the pool. HDP then presents automatically managed, wide-striped block devices to one or more hosts. This is like the use of a host-based logical volume manager (LVM) and its wide striping mechanism across all member LUNs in its “pool”. The LUNs presented by an HDP pool are called Dynamic Provisioning Volumes (DPVOLs or virtual volumes). DPVOLs have a user-specified logical size up to 256TB. The host accesses the DPVOL as if it were a normal volume (LUN) over one or more host ports. A major difference is that disk space is not physically allocated to a DPVOL from the pool until the host has written to different parts of that DPVOL’s Logical Block Address (LBA) space. The entire logical size specified when creating that DPVOL could eventually become fully mapped to physical space using 42 MB pool pages from every device in the pool.
Hitachi Dynamic Provisioning (HDP)
Adaptive Data Reduction (ADR) adds controller-based compression and deduplication to HDP, greatly increasing the effective storage capacity presented by the array. A lossless compression algorithm is used to reduce the number of physical bits needed to represent the host-written data. Deduplication removes redundant copies of identical data segments and replaces them with pointers to a single instance of the data segment on disk. These capacity saving features are supported in conjunction with HDP, so that only DPVOLs can have either compression or deduplication plus compression enabled. (Deduplication without compression is not supported). Each DPVOL has a capacity saving attribute, for which the settings are “Disabled”, “Compression”, or “Deduplication and Compression”. DPVOLs with either capacity saving attribute set are referred to as Data Reduction Vols (DRDVOLs). The deduplication scope is at the HDP pool level for all DRDVOLs with the “Deduplication and Compression” attribute set.
Adaptive Data Reduction (ADR)
The data reduction engine uses a combination of inline and post-process methods to achieve capacity saving with the minimum amount of overhead to host I/O. Normally with HDP, each DPVOL is made up of multiple 42 MB physical pages allocated from the HDP pool. But with data reduction enabled, each DRDVOL is made up of 42 MB virtual pages. If a virtual page has not yet been processed for data reduction, it is identified as a non-DRD virtual page and is essentially a pointer to an entire physical page in the pool. After data reduction, the virtual page is identified as a DRD virtual page and it then contains pointers to 8 KB chunks stored in different physical pages in the HDP pool. The initial data reduction post-processing is done to non-DRD virtual pages that have not had write activity in at least five minutes. The non-DRD virtual page is processed in 8 KB chunks and compressed data are written in log-structured fashion to new locations in the pool, likely one or more new physical pages. If enabled for the DRDVOL, deduplication is then performed on the compressed data chunks, so that duplicate chunks are invalidated (after a hash match and bit-by-bit comparison) and replaced with a pointer to the location of the physical chunk. Garbage collection is done in the background to combat fragmentation over time by coalescing the pockets of free space resulting from invalidated data chunks. Subsequent rewrites to already compressed data are then handled purely inline for best performance.
A major advantage of Hitachi’s approach to data reduction is the ability to customize settings for each LUN. For example, compression (but not deduplication) could be configured for LUNs on which a database’s unique checksums in each 8K block could make deduplication less effective. Both compression and deduplication could be enabled for LUNs hosting virtual desktops, which may contain multiple copies of the same data. And if an application encrypts or compresses its data on the host, making additional capacity savings impossible, then ADR can be completely disabled on the affected LUNs to avoid unnecessary overhead. The flexibility of Hitachi’s ADR allows capacity savings to be obtained on appropriate data, with the least possible impact to host I/O.
Front End Configuration
Two options for host connectivity are currently offered—fibre channel and iSCSI. The VSP E590 and VSP E790 support 1-3 channel board pairs, to be installed in the rear of the chassis as shown in Figure 1. (Protocol types must be installed symmetrically between controller 1 and controller 2).
Figure 1. VSP E590 and VSP E790 CHB locations
Channel boards must be installed in pairs. The FC CHBs support transfer rates up to 3200 MB/s. iSCSI CHBs support transfer rates up to 1,000 MB/s. Additional details about the CHBs are shown in Table 2 below.
Table 2. Host Connectivity Options
|CHB||Transfer Rate||CHBs Per System||Ports Per CHB||Ports Per System|
|16 Gb Fibre Channel||400/800/1600 MB/s||2/4/6||4||8/16/24|
|32 Gb Fibre Channel||800/1600/3200 MB/s||2/4/6||4||8/16/24|
|Fibre iSCSI||1000 MB/s||2/4/6||2||4/8/12|
|Copper iSCSI||100/1000 MB/s||2/4/6||2||4/8/12|
VSP Ex90 FC ports operate in universal (also called bi-directional) mode. A bi-directional port can simultaneously function as a target (for host I/O or replication) and initiator (for external storage or replication), with each function having a queue depth of 1,024.
The VSP E590 and E790 Controllers
Processing power and high-speed connectivity are at the heart of the 2U VSP E series dual-controller system (as shown in Figure 2). Let’s begin with the controller’s connection to the channel boards. The “A” and “C” channel board slots connect to the controller via 16 x PCIe Gen3 lanes, and thus have 32 GB/s of available bandwidth (16 GB/s send, and 16 GB/s receive). The “B” slot gets eight PCIe Gen3 lanes, and therefore has 16 GB/s of theoretical bandwidth. Each controller has two multicore Intel Xeon CPUs, linked by two Ultra Path Interconnects (UPIs), each supporting up to 10.4 gigatransfers per second. The two CPUs operate as a single multi-processing unit (MPU) per controller. Like previous Hitachi enterprise products, all VSP E series processors run a single SVOS image and share a global cache. Cache is allocated across individual controllers for fast, efficient, and balanced memory access.
Figure 2. VSP E590 and VSP E790 Controller Block Diagram
Each CPU has six memory channels for DDR4 memory, providing as much as 115 GB/s per CPU of theoretical memory bandwidth (up to 230 GB/s per controller, and 460 GB/s per system). Data and command transfers between controllers are done over the two non-transparent bus (NTB) connections, which together are allocated a total of 16 x PCIe Gen3 lanes. All VSP E series arrays feature two NTB connections between controllers, thus avoiding any single failure point for this critical component. Finally, each controllers’ CPUs are connected via an embedded PCIe switch to the NVMe SSDs. Each controller can establish a point-to-point connection to each of up to twenty-four drives over a 2-lane, 4 GB/s PCIe Gen3 bus.
The primary difference between VSP E790 and VSP E590 is processing power. The VSP E790 comes equipped with four 16-core CPUs, while the VSP E590 has four 6-core processors. Our testing shows that the VSP E790 has enough processing power to approach the full IOPS potential of the fast NVMe drives. For example, on the VSP E790 we measured 3.64 million 8 KiB cache-miss random read IOPS from eight SSDs, with a response time of 0.51 milliseconds. On the VSP E590, the same test yielded 1.34 million IOPS with a 0.53 millisecond response time. With the same drive configuration as the E590, the E790 could deliver 2.7X more IOPS because of its high-powered CPUs. Of course, the VSP E590’s 1.34 million cache-miss random read IOPS will be more than sufficient for many applications.
Encrypting controllers (eCTLs) are optionally available for the VSP E790 and VSP E590. The eCTLs offload the work of encryption to Field Programmable Gate Arrays (FPGAs). The FPGAs are connected by a PCIe switch positioned between the controller CPUs and the flash drives. The FPGAs allow FIPS 140 level 2 encryption to be done with little or no performance impact.
Logical Devices and MPUs
The CPUs of the Ex90 controllers are logically organized into multi-processing units (MPUs). There is one MPU per controller, thus two MPUs per system. When logical devices (LDEVs) are provisioned, LDEV ownership is assigned (round-robin by default) to one of the two MPUs. The assigned MPU is responsible for handling I/O commands for the logical devices it owns. Any CPU core in the MPU can process I/O for an LDEV assigned to that MPU. Therefore, all of the array’s processing power can be leveraged by distributing a workload across a minimum of two LDEVs (one per MPU). However, SVOS multiprocessing can run a bit more efficiently with a larger quantity of logical devices as discussed in this HDP blog.
The importance of storage array cache (i.e., Dynamic Random Access Memory or DRAM) may have diminished in the era of NVMe flash drives. However, I/O cache can still provide a significant performance boost for several reasons. First, DRAM is faster than flash, at least by an order of magnitude. When a host requests cache-resident data (sometimes called a cache “hit”) the command can be completed with the lowest possible latency. For example, our 8KiB random read hit testing on VSP E590 measured response times as low as 66 microseconds (0.066 milliseconds). The fastest response time observed in the 8 KiB random read miss testing was about 4X higher at 250 microseconds. Reading data from cache is not only advantageous because DRAM is the fastest medium, but also means that the request for data is satisfied with no additional address lookups or back end I/O commands. And when it comes to optimizing performance, Hitachi isn’t content to simply cache the most recently accessed data. I/O patterns on each LDEV are periodically analyzed to identify the data most likely to be accessed repeatedly. Cache is preferentially allocated to such blocks. Meanwhile, areas of each LUN identified as having the lowest probability of a cache hit will not have any cache allocated. Instead, such data are sent to the host through a transfer buffer, thereby saving the overhead of allocating a cache segment.
Cache also enhances write performance. After writes have been mirrored in both controllers’ DRAM to protect against data loss (but before data have been written to flash) the host is sent a write acknowledgment. The quick response to writes allows latency-sensitive applications to operate smoothly. Newly written data are held in cache for a while, to allow for related blocks to be aggregated and written to flash together in larger chunks. This “gathering write” algorithm reduces the need for parity operations, thereby bringing down controller and drive busy rates and improving overall response time.
Back End Configuration
The VSP E590 and VSP E790 both have an all-NVMe back end integrated into the controller chassis, which makes configuration simple and straightforward. As noted earlier, up to twenty-four NVMe SSDs may be installed. Table 3 lists the supported drive types, and Table 4 shows the available RAID configurations.
Table 3. NVMe SSD Options
|19RVM||NVMe SSD 3DWPD||1.9TB|
|38RVM||NVMe SSD 1DWPD||3.8TB|
|76RVM||NVMe SSD 1DWPD||7.6TB|
|15RVM||NVMe SSD 1DWPD||15TB|
|30RRVM||NVMe SSD 1DWPD||30TB|
Table 4. Supported RAID Configurations
|RAID-10||2D+2D, 2D+2D Concatenation|
|RAID-5||3D+1P, 4D+1P, 6D+1P, 7D+1P, 7D+1P Concatenation|
|RAID-6||6D+2P, 12D+2P, 14D+2P|
Hitachi storage has often been configured with RAID-6 6D+2P, or perhaps RAID-6 14D+2P for data protection and good capacity efficiency. However, if any spare drives are to be allocated in the 2U E series arrays, only a single 14D+2P or two 6D+2P groups could be created. We therefore tested an asymmetrical configuration with I/O distributed across one 6D+2P parity group and one 12D+2P parity group in a single HDP pool--a configuration which offers RAID-6 data protection, good capacity efficiency, with one or two spare drives. We found no difference in performance between the asymmetrical configuration that allows for spare drives, and a symmetrical configuration with three RAID-6 6D+2P parity groups.
We have briefly introduced the architecture of the Virtual Storage Platform E590 and E790. For more information on Hitachi’s implementation of NVMe, and other exemplary features of the VSP E series, see Hu Yoshida’s recent blog entitled “Unique VSP Capabilities Which Were Not Noted in the Gartner Report.” Also see the VSP E Series page.