GPU Deep Learning Cluster Considerations

By David Pascuzzi posted 11-08-2019 15:23

  

Deep learning is hot today and complex GPU cluster server solutions are spring up left and right. Everybody is offering deep learning clusters with tons of software added onto their environment and selling high end complex storage options.  They are offering you a mini Nvidia Saturn V supper computer. You need to ask yourself is this what I need, and can I make use it.
Do you have tens or hundreds of data scientist? Do you need hundreds of nodes? Are you using tens of terabytes of data deep learning data? Are you trying to build a high-performance computing (HPC) cluster? Do you really need racks of GPU servers with each rack having its own storage subsystem? In almost every case the answer is no.
The most important requirement for a deep learning environment is it needs to help data scientists and data engineers. Their jobs should be made easier, faster and have more accurate results.  That means an environment that helps them with data preparation, data access, easy for them to use, and one that they can develop their deep learning code with the tools they want to use.

Storage

One area to investigate is your storage. Data is the heart of DL. There are multiple designs of DL clusters with high-end servers that have all sorts of different ways to store your DL data.
The first thought that comes to mind is wait a minute, I already have my data stored.  You do, in lots of different places and formats. What you need is a spot for your data engineers or data scientists to pull the data together, merge it, scrub it, tag it, etc.
Why not use your existing storage? You could and should for permanent storage.   Once the data is prepared, you should push the data back out to your existing data lake.  Performing your deep learning with the data on shared staging storage and with one way to access it makes it simpler for the data scientists to work with it.
Why not just move all the data to the new storage area? You could.  In most cases these data sets are small in physical size.  But they have lots of files and records.  Look at the ImageNet data set, it has 1.2 million pictures and each picture has a tag file. If you want to use just some of these images how would you find them?
If you are going to stage the data, there are three main classes of ways to store the data. Each way has its own characteristics
Local Storage
  • You are going to have some local storage, at the very least it should be used it as a second-tier cache
  • Easiest to use
  • Fast
  • Good option for single node training
  • Data can’t easily be shared across nodes.
NFS Server
  • Easy to share data
  • Can be setup to cache on the local machines
  • You probably already have one in your environment
High end clustered file system/Distributed file system
  • Complex to setup and maintain
  • Better choice for high write environments.
  • Can be very expensive, requiring multiple storage server
  • Has higher minimum cluster size
An option to improvement performance is to have the deep learning framework cache/store the data locally. For example, TensorFlow can cache data to memory or disk.   This works when getting your data from any source: Files, Hadoop, Streams, RDBMS, NoSQL, etc.

Network

If you are accessing the data with NFS or storing/caching the data locally, the network speed between any remote staging area and the deep learning nodes isn’t a critical factor. The data is read once.  If the deep learning code is running for hours or days or longer a few minutes extra for the first read doesn’t matter.  If your data sets are small enough to reside in memory doesn’t matter if you always read from you primary staging area
Often times data is preprocessed and stored in the internal deep learning framework format. If the data isn’t preformatted, CPU processing time will have more impact on performance than the network speed.
In almost all cases, a 10 Gb or 25 Gb network, will provide more through put than you need. This doesn’t mean you shouldn’t have an ultrahigh-speed network for other uses. If you are using a GPU cluster with multiple node training, you may want a 100Gb network for GPUDirect communication. You will just want to make sure that you are able to make use of it.

Deep Learning Coding

There are data scientists out there that are amazing coders and others that aren’t. Regardless of their programing skill level, the more complex the cluster is the harder it is for them to use efficiently. You just can’t wave your magic keyboard and have the code scale efficiently when going from 1 to 8 GPUs and then continue to do so when scaling across multiple nodes.
To tout hardware scalability hardware vendors run ResNet-50 with ImageNet data on TensorFlow.  If you do a line count across all the files in the benchmark directories, you will find over 12,000 lines of code. Yes, this is a poor way of figuring out how complex the code is, but it does give you an idea of how large of a project it is. This benchmark code has been tuned and optimized and tweak by many data scientists for years. It does a great job of scaling images per second with multiple GPUs and multiple nodes.
Image per seconds is a very interesting way to show scalability.  On a single server, it can correlate to equivalent reduction in training time.  When working across multiple servers you may not see close to the same gain. Translating from images a second gain doesn’t always scale 1 to 1 to the same improved performance in deep learning to a given accuracy.
These are some of the issues with scaling DL code, that can be seen in other TensorFlow samples. In the worst cases as you add more GPUs, the code sees little or no performance gains.  The more GPUs you have the more work you must do to make sure that you have enough data ready for them to process.

How can we help

Hitachi  Vantara can help meet you DL needs, Hitachi Solution for Analytics Infrastructure using Hitachi Unified Compute Platform RS V225G for Deep Learning allows us to provide a customized solution. One node, multiple nodes, external storage, internal storage, we have the option to do what’s best for you. Some of our options are
  • 100 GB network for GPU Direct support
  • Multiple local storage options to cache the data locally
  • UP to 1.5 TB of memory to provide even faster caching
  • External HANS Storage provides a central repository for your scrubbed data and to store your machine learning code
  • The ability to run in a virtualized environment
Our Pentaho Platform enables organizations to access, prepare, and analyze all data from any source, in any environment. For deep learning, this means it can pull all your data together from multiple data sources, merge it, scrub it and help label it. With it taking hours, days or even weeks to run the deep learning code, this is oftentimes overlooked. This process can also take hours, days, weeks or even years.
When Pentaho is added to the solution, you have access to its PMI plugin can be used to integrate Pentaho with machine learning frameworks.  The drag and drop capability of Pentaho allows you to perform your machine learning without writing any code.
Hitachi Vantara has multiple solutions to support you Analytics and Artificial Intelligence needs. To learn more about our solutions, visit our website.
To further help, Hitachi Vantara has Deep Learning experts. We can jump start your efforts.
#HitachiNetworkAttachedStorageNAS
#Blog
#ThoughtLeadership
0 comments
2 views

Permalink