During my recent Southern California road trip, I marveled at how easy it was to avoid the usual LA traffic bottlenecks and find the quickest way to the beach using Google Maps on my mobile phone. This was a life saver especially with two restless kids in the backseat. Unfortunately, we don’t have these type of conveniences to help steer us in the right direction when trying to find or get around bottlenecks in the data center. However using the proper analytics tools, that Gartner now refers to as Artificial Intelligence IT Operations (AI Ops), can certainly help to direct us onto the right data center path.
This is key when trying to optimize IT resources or quickly troubleshoot data center problems, which at times can feel like trying to find a needle in the haystack. Often it's the first question that is the hardest to answer, “Where do I start?”.
A call might come in from the application owner, "Oracle is running slow today and its impacting my users", or multiple alerts might be going off on your monitoring dashboard. But Oracle isn’t deployed alone. It runs on top of the multi-vendor IT resources in the data center, including servers, hypervisors, storage, etc, that makes finding the right path to resolve a problem even more complicated.
To help with this analytics topic, I asked my colleague, Ojay Bahra, Global Product Manager for Hitachi Vantara, for some insights. Ojay has extensive experience with Hitachi analytics and storage products spanning 16 years. He has a wealth of knowledge on how analytics can direct you on the right path to optimize data center performance and diagnose IT resource problems.
Richard – Hello Ojay. What are some basic analytic tips one should follow when trying to properly monitor their data center environment?
Ojay – A very good question, Richard. To properly monitor data centers today, it is vital to start by reviewing and examining some key data performance indicators such at utilization, response time and throughput for key resources like servers, virtual machines or storage systems.
Richard – OK, how does one know whether they are having any performance issues or not?
Ojay – Today’s modern IT monitoring tools provide some form of base measurement point we call a threshold. These can be applied to different resources along the data path from your host server or VM to shared storage resource. Either you would know what values to set the threshold or perhaps your monitoring tool can tell you what the average values are. Although the threshold doesn’t point out if you have an issue per se, it can give you an indication if a resource has hit a particular high value compared to its normal operations. In other words, this can help to establish a performance baseline for you to set appropriate thresholds to measure against. This in turn could be used to trigger an alert when an anomaly is detected or a high value was observed that exceeds its normal operating range. These threshold indicators generally can be set like a traffic light system, green is good, amber is something of concern, while red indicates you have a problem.
For example, Hitachi Vantara offers a complete IT analytics solution, Hitachi Infrastructure Analytics Advisor (HIAA), providing end-to-end data center monitoring, analysis and troubleshooting. HIAA includes a centralized dashboard (see below) with a similar traffic light monitoring system as discussed so you can easily view the status of your data center environment at a glance.
Richard - If you suspect there is an issue with a red status indicator, how do you verify you have a problem and what should you do about it?
Ojay - When trying to manage any performance problem, you have to narrow down the troublespot because there may be expectations or possible finger pointing that the problem is not in a particular area (application, server, storage, etc). In other words, often times you need to follow a process of elimination. We need to start by using analytics to give us a high level overview across our data center operations. For example when an application is reported as performing slowly, we need to determine all the associated IT resources (server, virtual machine, network ports, storage resources, etc.) tied to that application server. With an analytic tool like HIAA, it is very easy to check the full I/O data path end-to-end from the application server to the shared storage resources. We can see the host server the application resides on, which SAN switch ports are in its I/O path, then show which volumes are being accessed on the storage system.
Richard – Once I can see the full data path for the application server, what should I be looking for?
Ojay – If we are still unclear where the issue is, we need to look in-depth at each component on the data path from host server to storage system. For example, first check to see if the storage system performance is OK. By using the available health and alert mechanisms on HIAA, it’s easy to examine the storage system’s health and see if any key performance values have exceeded their assigned thresholds. If the storage array does not appear to be a concern, then we should apply a deeper dive analysis against the host server and SAN switch components, checking threshold values applied to these various datapoints. What we are trying to do is present a more accurate view on all points of the data path to eliminate areas and shorten the analysis time.
Richard – I can see using the above view, the SAN switch looks OK, but my hosts/virtual machines (VMs) are red with alerts as are the storage systems. What does this indicate?
Ojay – So immediately, we have identified that we can eliminate the switch as an area of concern. We should isolate the red indicators from the rest of the IT view and concentrate on looking at the information provided for both the Hosts/VMs and the storage systems.
Richard – How should you do a deep dive examination on these resources (Hosts/VMs, storage systems), excluding the SAN switch of course?
Ojay – Because HIAA provides different views for analysis, one can look at each particular resource area by isolating the views. We can verify any suspected bottleneck by reviewing each dependent resource further down the data path. We can take any resource which have been flagged red and compare its performance against similar resources. It is this resource comparison where we can further identify anomalies and isolate the problem.
Richard – Can HIAA help further to determine the root cause of the problem and provide suggestions on how it could be fixed?
Ojay – Yes, HIAA offers multiple troubleshooting aids for analyzing a bottleneck resource and identifying its root cause. As we discussed earlier, HIAA provides various analytic views and performance charts to properly analyze the suspected bottleneck and all of its dependent resources.
Often times the problem is usually related to resource contention also known as the noisy neighbor problem. This is where a particular resource may disrupt the balance of usage for a shared resource like a switch or storage port. Another common cause of recent performance impacts may be a recent configuration change that you are not aware of. With these common scenarios, HIAA can help you verify, diagnose and determine the root cause of the problem while giving you suggested changes to correct the problem.
In addition, HIAA is integrated with Hitachi Automation Director (HAD) for data center management automation and orchestration to streamline any required configuration changes. For example if HIAA suggest moving this noisy neighbor host to another (less congested) shared storage port, you could have HIAA initiate the appropriate service template in HAD to automate this host configuration change to resolve the problem. This close integration of analytics and management automation can greatly accelerate an quick fix to similar problems moving forward.
Richard - Thanks Ojay for your great insights as this example shows that finding and troubleshooting data center problems can be a lot easier using a tool like HIAA. To recap, these easy troubleshooting steps to follow in the data center are:
- Monitor IT dashboards with thresholds for new alerts and anomalies.
- Isolate trouble spots by reviewing each data path resource from host to storage and all their dependent resources.
- Examine suspected problem resource(s) while comparing their performance against similar resources.
- Leverage built-in tool diagnostics to determine the root cause of the problem and obtain suggested fixes. Plus utilize any integrated automation tools to quickly implement any problem fixes.
Even without interactive maps to help you avoid bottlenecks in the data center, using the right analytic tools can certainly help you get started and give you the necessary insights to find the right data center path.