Nathan Moffitt

Roadmap to an Autonomous Data Center

Blog Post created by Nathan Moffitt Employee on May 22, 2018

Reduce Risk, Improve Efficiency and Prepare for the Future with AI Operations

 

I like to write. Especially on topics I'm passionate about. Normally that means I write a page or 2 on a topic. Today though I'm trying something different. I'm digging in to provide a deeper look at how you can build a roadmap for an autonomous data center.

 

I'd say a plan for an autonomous data center, but the fact is that we're just at the forefront of seeing solutions that enable autonomy. And unfortunately, most vendors are designing software that is too 'vendor specific' and narrow in scope. To get where we want to be, software offerings need to work together, integrating insight and action so the data center can manage itself. Only this way will staff be truly free to focus on innovation.

 

 

Hitachi is pushing to make this happen faster. Pushing outside our normal comfort zone and looking at ways to accelerate change. But there's a lot to do. Hence, the length of this post and the need for an infographic (Hey, I used PowerPoint, don't hassle me. ). So read on!

 

roadmap.png

 

 

Your Data Center. Simple in Silos.

 

When individual applications or infrastructure components are deployed, things often seem simple. Resources are delivered, monitoring is put in place and everything looks good. At a siloed, 'project level' this is true.

 

When you pull the lens back though and look at the data center as a whole, you see groups of systems, networks and software working together / sharing resources to perform tasks. You see a living organism where an issue in one area can cause ripples in the data center fabric that impact uptime, performance, resource utilization and ultimately the customer experience as well as your budget or regulatory compliance.

 

AI Operations Enable an Autonomous Data Center

 

To increase confidence that operations will run smoothly, AI Operations software is needed. AI operation software collects analytics from across your data center to predict and prescribe adjustments that help the entire data center run more efficiently. It can (should) also automate processes to accelerate action so that you begin delivery of an autonomous data center.

 

But where do you start and how do you approach implementing AI operations to govern the systems, software and services that make up your data center? Feedback from our customers tells us that there are a few steps to consider:

 

  • Step 0: Set Near and Long Term Scope
  • Step 1: Automate Deployments
  • Step 2: Implement Data Center Analytics
  • Step 3: Combine Analytics with Automation
  • Step 4: Extend the Framework
  • Step 5: Enable Tactical and Strategic Autonomy

 

Note: For every stage it is important to note that AI based analytics are only as good as the data received. Special emphasis should be made on quality, granularity and length of data analyzed for accuracy.

 

Step 0: Set Near and Long Term Scope

 

Before purchasing and implementing AI operations software for a data center, it is important to define what you want from the solution – near and long term. This should include definition of:

 

  • Data center devices – systems, software and services – that AI will manage
  • Operations you will allow AI to handle autonomously
  • Operations you will allow AI to handled semi-autonomously
  • Data center devices you will want AI to manage long term

 

The last point is especially important because while AI offerings are rapidly evolving, their scope of coverage is still relatively narrow. Many offerings are vendor-specific with limited ‘line of sight’ to how their actions could affect other systems – positively or negatively.

 

To minimize the potential for “silos of AI operations” that interfere with each other, define a clear scope of what will be controlled, and how other systems will be affected if AI acts autonomously. It is also important to understand how devices will be added over time.

 

Note: API-driven offerings can help smooth the expansion of AI across your data center by providing a common interface for communication, in particular when existing management practices and processes must be integrated. This can enable long term agility and enable creation of a collective AI that leverages the expanded set of analytics to make increasingly smarter decisions.

 

Step 1: Automate Deployments

 

Perhaps the best first step to successfully implement an autonomous data center, is ensuring best practices and associated polices are followed. When resources are deployed according to best practices, their behavior is predictable and the need for AI to identify complex or unseen issues is minimized.

 

Best practices alone though are not enough, especially if numerous configuration tasks must be executed during deployment. To prevent accidental errors like a step being skipped or followed improperly, best practice processes must be automated. Automation software helps ensure the successful delivery of systems, software and related services like data protection by implementing:

 

  • A predefined catalog of best practices for deploying systems and software
  • Customizable best practices to support your specific data center resources, service level objectives and data management policies.
  • An engine to automatically implement the catalog with minimal human interaction

 

These features enable your staff to provision and manage data center resources, greatly reducing the risk of downtime, data loss and sub-optimal performance. They also free your experts to focus on driving the business forward, not troubleshooting deployments.

 

AI CONSIDERATION: Automation engines can be designed to do more than follow a guided set of steps. AI can look at available resources and determine which are under-utilized or will provide the best ‘experience,’ increasing ROI. If an automation AI understands the data path and workloads, it can help prevent issues that impact application stability and end user experiences.

 

You should also consider how automated actions will be tracked. This way you have a history of events & actions that were performed for ongoing analysis. Integration with ITSM tools is helpful here.

 

Step 2: Implement Data Center Analytics

 

Once resources are deployed, it is important to make sure they continue to perform as expected – individually and as part of a whole ecosystem. If environments are not regularly tuned as a complete system, they will never deliver a maximum performance and stability. Only through ongoing monitoring and optimization can you prevent systems from degrading over time and impacting broader data center operations.

 

To keep operations running smoothly, data center analytics software incorporates AI and ML (machine learning) that looks across your environment to determine what is happening – or has happened – and what to do next. This includes:

 

  • Ecosystem Optimization
  • Budget Forecasting
  • Fault Prediction and Identification
  • Anomaly Detection
  • Root Cause Analysis and Prescribed Resolution

 

It is important to note that many analytic offerings are product, not data center focused. This limits their ability to accurately forecast needs and identify fault resolution. To achieve the best possible outcomes, dependencies along the data path must be considered before making a decision. 

 

AI CONSIDERATION: Where and when AI decision making occurs is important. If it happens offsite, make sure your organization allows external transmission of system information. If data is only collected every few hours, understand how that will affect speed and quality of analysis.

 

Step 3: Combine Analytics with Automation

 

Analytics deliver powerful insights into how operations are performing and what changes should be made to improve / repair the environment. But if analytics only inform or prescribe changes, you are still responsible for executing prescribed actions.

 

This may be appropriate for some actions, e.g. issuing a purchase order for more capacity, but in other instances it can delay issue resolution and create risk. As noted in Step 1, automation is critical to minimizing the potential for accidental errors. By linking automation with analytics, data center teams can significantly reduce the time to implement changes and assure adherence with best practices. For instance:

 

  • Real Time Configuration Adjustment: E.g. Analytics AI identifies a performance issue and prescribes a change to QoS levels. It then executes update via the automation engine.
  • Service Execution: E.g. Analytics AI identifies a data protection policy has not executed a snapshot recently. It then executes automatically runs the snapshot service.

 

Some organizations may decide to start with a solution that combines insights and action. Others, may implement these functions in discrete stages. In the latter case, it is critical to identify vendors that can offer upgrades or product / vendor integrations to combine offerings.

 

AI CONSIDERATION: Analytics and automation offerings can each have their own AI functions. In most offerings the analytics serve as the ‘brain’ and automation is the ‘engine,’ but it is still important to understand if they can work together to make smarter decisions. Over time, analytics and automation will likely become more tightly coupled to improve efficiency.

 

Stage 4: Extend the Framework

 

For many, the journey to an autonomous data center will likely pause after Step 3. This allows teams to review predictive and prescriptive analytics to improve best practices as well as expand the scope of actions that are automated.

 

After this is accomplished, it is time to identify areas where the framework can be extended. There are multiple paths forward that organizations may consider including:

 

  • Deeper data path integration: E.g. Integrating application analytics to measure the impact of latency on transactions and use that information to more precisely define QoS levels or forecast when resources will be needed to meet performance SLAs.
  • Broader Service Management Control: E.g. Integration an infrastructure automation engine with an ITSM platform to enable better control over deployment and management of data center components for a more robust service management experience.
  • Facilities Analytics: E.g. blending additional data sets like power, cooling analytics for making better decisions around energy & operations management.

 

How this stage is implemented will vary significantly based on organizational needs and may require professional services work depending on the outcomes desired. It will be worthwhile though as the learnings here lay the ground work for Step 5.

 

AI CONSIDERATION: A key factor in any AI implementation is establishing how and when AI will interact with human counterparts. During initial deployments, and especially as the framework is extended, it may be desirable to have machine-to-human communications occur before any action is taken. Over time though, as comfort levels increase, you may decide to allow greater AI autonomy and only receive notifications that actions have been taken.

 

Step 5: Enable Tactical and Strategic Autonomy

 

Up to this point we have focused on an overarching AI to manage data center operations. This focus is based on the idea that most systems and software may be able to govern themselves, but they do not have the ability to collaborate with other systems to make decisions.

 

Long term though this will change. Over the next several years we will see increasing levels of intelligence in the systems that make up data centers. At that point we will want to turn over certain tactical, AI operations to sub-sets of systems in the data center.

 

For instance, applications may work with network and storage devices to determine the best path or location to route data and work around faults. Or applications may predict upcoming job types and, based on associated SLAs, request migration of data sets to higher performance storage. 

 

It will still make sense to have an overarching AI analyze and execute strategic operational decisions, but tactically it is important to allow sub-sets of devices to make real-time decisions about how they work together to achieve discreet goals and overcome local obstacles.

 

Ultimately, Step 5 is about taking the concepts of initial stages and applying them to groups of devices that must work together. As before, this will require an API-driven interface for devices to communicate, a shared language and hierarchy of leadership for making joint decisions. How this plays out is still being defined and will likely expand Step 5 into multiple stages. Until then, the best thing to do is look for offerings that have a roadmap for device-to-device communications.

 

 

So there you have it. I hope you found this informative. This is a topic I've been waiting months to blog on, and am excited to discuss. If you have questions or comments, let me know. Happy to discuss and always open to ideas on how we expand the conversation.

Outcomes