Sevag Mekhsian

Omitting SAP Unplanned Downtime

Blog Post created by Sevag Mekhsian Employee on Jan 14, 2016

SAP is one of the most mission critical applications for any organization, because it runs the entire business. An IDC study, published in 2015, determined that for a Fortune 1000 company, an unplanned downtime of a mission critical application will cost between $500,000 to $1,000,000 per hour of downtime. In fact, I can think of many organizations for which the damage resulting from unplanned SAP downtime will be significantly higher than $1 million per hour, because these businesses rely on SAP for conducting their ongoing operations, including sales.

 

I recently had a discussion with a colleague, regarding the main causes for unplanned downtime to the SAP environment, and how these can be prevented. That discussion led to writing the blog you’re now reading. The content and advise in this blog are based not only on my own 8 years of experience as an SAP expert and team leader; this blog is based on the accumulative experience of nearly 600 SAP experts working at oXya, a Hitachi Data Systems company, specializing in managed services for SAP customers. oXya has been managing SAP environments for enterprise customer for the past 18 years; we are currently managing SAP environments for more than 260 enterprises around the world, with more than 250,000 SAP users.

 

I’ll begin with listing some of the most common causes that lead to a crash of SAP, and then move to some proactive steps that can be taken, in order to avoid these crashes.

 

What are the main causes for downtime of an SAP environment?

 

Outages may occur in the SAP Production environment due to various causes. The following list describes the most common causes we’ve seen over the years for unplanned downtime of SAP environments:oXya14 (Custom).jpg

 

Lack of Monitoring. This is by far the leading cause for outages of the SAP Production environment. For example, we have customers that, prior to working with oXya, never used to monitor their servers, not even the Production servers. By “monitoring the servers” I mean that disk drives were not being monitored, so that when drives were running out of space and there was no storage space left on them, no one was aware of that; if we’re talking about a hard drive that is crucial for the database operations, and the database could not write to the hard drive, then that would bring down the entire Production instance, causing users to not be able to do their work.

 

Hardware issues. When there is a device failure on the system’s server, whether this is a hard drive failure, a motherboard issue, or any other type of general hardware issues, and there’s also a lack of redundancy in place – this can bring down the Production instance.

 

Network related issues. Network issues can occur, for various reasons. This can be hardware-related, such as a case in which one of the firewalls or switches has an issue, but it doesn’t have to be hardware related. What happens when a network issue occurs is that the users are blocked from being able to access the SAP system. Hence, even though the SAP system itself is up and functions properly, this type of issue is considered an outage, because the business is not able to connect with the SAP systems, and users are unable to do their work. In many cases, network issues occur due to lack of redundancy, meaning systems that are not built with sufficient redundancy in place. If there is a single point of failure, somewhere, and that single point does fail, then it will cause an outage or outage-like consequences. The key here is to identify and minimize single points of failure, by as much as possible.

 

Performance related issues. Performance issues are is not necessarily an outage, yet it can impact the users to a point where it makes the SAP system unusable. For example, during end of month activities on an SAP Production system, there are critical tasks that must be completed within a certain timeframe. If there is a performance issue, such as database performance or something similar, the business is unable to complete the work within the allocated timeframe. This is considered an outage to some extent, because the system is not working as intended; the system does not provide what it needs to, even though it is still working and didn’t crash.

 

Insufficient proactivity or reactivity to the system. From a pure technical perspective, and this is also tied to monitoring—if there is a critical issue and this issue is not resolved, then down the line, if continues to be neglected, there’s a high chance that it will come back and bite your behind. This is one of the sure ways to guarantee unplanned downtime. Let’s begin with an example for what I call ‘lack of reactivity’. For example, assume that we observe a disk that is on its way to getting full. It still has some space so we’re not taking any steps now. Then, it indeed gets full and causes a system crash. This type of crash can fall under the ‘lack of monitoring’ category, but it’s also the lack of taking the necessary steps when witnessing possible things that can cause issues. This is lack of reactivity.

 

By lack of proactivity, I mean making sure that alarms are not being triggered on the system. For example, in SAP there are health check jobs that need to run – clean some tables, and so on. If these health checks and house cleaning jobs are not being performed, it can lead to issues that cause downtime. The proactivity here means that at oXya, we have people who are handling customers’ systems on a daily basis, making sure that the systems are in healthy state. If this piece of proactive treatment is missing on your Production environment, it can certainly cause issues and even downtime at some point.

 

Testing. We view testing as major cause for downtime, and it has both technical and business aspects. From a technical aspect, patches and updates (software, firmware) need to be applied, in order to photo_58702_20151221 (Custom).jpgkeep the system up to date. If a system is out of date (patch wise), there can be a software failure that causes the system to go offline. Lack of testing, when applying a patch to the Production system, can cause issues. For example, let’s say that there’s a Windows patch that needs to be applied to the server, because it’s a critical software patch that fixes a known bug. Let’s further assume that this bug is known for its ability to crash the SAP system. The correct way to perform testing is to apply this patch to your Dev environment and test it for a week, to make sure that everything is working fine. Then, you install the patch on your Quality environment, do another week of testing to make sure there are no issues. Only then, you install the patch on your Production environment.

 

We’ve seen cases where this testing routine was not performed, and the patch was applied simultaneously to all three systems (Dev, QA, Production). If this is done, you’re at risk of discovering there’s a potential issue with SAP and this patch—for whatever reason the patch causes an issue with the SAP application, which can lead to an outage.

 

The key is to perform rigorous testing whenever a change is introduced to the SAP environment, whether it’s a patch, a functionality that is being implemented by the business, etc. Such changes need to be gradually applied and thoroughly tested, with large-enough time window, so that any issue is caught before the change hits Production.

 

Updates & patches. Keeping the system up to date should be a top priority for any SAP team; unfortunately, we see many systems that are not kept up to date, and this is especially the case with Production systems. We do ‘understand’ the reasoning for systems that are not up to date; downtime, especially planned downtime where you have to bring a system down to apply a patch, is not something that can be easily provided by the customer/business. Hence, it’s always a challenge to keep the systems up to date, whether it’s an operating system patch, a firmware update, or any other type of patching that needs to be applied.

 

If the entire system is not kept up to date, there are potential bugs or issues that can occur, and can bring down the system. We’ve seen that in the past – there was a bug in the Windows environment that crashed systems; this occurred because a specific patch, that was released two years earlier, was never applied to the systems that crashed. If patching and updating of the environment is neglected, this can (and most likely – will) eventually cause downtime to your SAP environment.

 

Unplanned Downtime & Human Error

 

The majority of unplanned downtime events can be tied to human error. Whether it is the result of lack of planning, lack of testing, lack of reactivity or proactivity actions, and so on. Of course, downtime can also occur without human errors, like for example if there is a circuit outage at the datacenter, and the entire datacenter goes offline. Such a case would usually be considered as force majeure; and still, one can say that even a circuit outage can be considered a human error, because something wasn’t tested, and/or something wasn’t done right, and/or there should be a disaster recovery (DR) site so the entire system (spread across more than one site) never fully crashes, and so on. However, at the end of the day, in such cases it’s difficult to put the blame on human error, because there are a lot of teams and many elements in play.

 

Some items are usually attributed to human errors and some aren’t. The items that are usually attributed to human error are:

 

Performance issues. Let’s look at an Oracle database, for example. The database has tables, and the tables have indexes. If an index is in ‘bad shape’, and hasn’t been rebuilt (this was not caught by the database administrators), then this can lead to performance issues. Furthermore, it can lead to a scenario where it is the end of the month, and the business cannot run specific actions on time. This causes headaches and ‘downtime’ for the users. If the administration team, through their proactivity and monitoring tools, found that this index needed to be repaired and caught that before month-end, then this downtime can be avoided.

 

denver_140821_3258_hi (Custom).jpg

Patching (or lack of) issues. If a crash took place, it resulted from a bug, and that bug was known and there was a patch that was not installed, then that also falls into the human error category. Such patches should be installed, after scheduling with the customer.

 

Lack of DR and/or single points of failure. Many Production systems have disaster recovery and replication in place, especially for enterprise customers, so that even if the main Production system goes offline, it is brought up on the replicated DR site pretty quickly. Still, there are many smaller companies that do not have a DR site. Such companies often host their SAP onsite, at their own offices, sometimes in a small onsite datacenter. Such installations typically have many single points of failure, such as not having a generator and relying on a single source of electricity; not having DR and replication for your system; having just one piece of network device that failed; and so on.

 

If we take an overall, honest look at human error, we have to admit that mistakes do happen. For example, administrators have above-average access to the system, so they can accidentally delete files, shut down systems, and so on. It’s not unheard of that an administrator had deleted a critical piece of software, which was a must for the system to be up and running. So, human errors along those lines can and do cause outages.

 

The constant struggle is to have some sort of a system in place, that minimizes such human errors by as much as possible. This involves having second checks in place. I can tell you how we operate at oXya – if there’s a critical action that’s being performed to a Production system over the weekend, for example, we don’t let a single person perform that action. We have a minimum of two experts, and sometimes three, working on the same item. The purpose of having multiple people working together, on the same item, is to double check everything and go over the actions that any specific person would take, in order to eliminate human mistakes by as much as possible.

 

Through these types of check systems, we can eliminate or at least significantly reduce the types of human errors that occur. At oXya, there’s a second person assigned to checking all the work being done, especially when dealing with the customer’s Production environment.

 

How can unplanned downtime events be minimized or omitted?

 

I’ve already provided some advice earlier, like always having more than one person when applying changes to the Production system; or keeping the systems up to date.

 

Taking the ten thousand foot view – the key to minimizing downtime is proactivity. All of the monitoring and automated checks that exist on all systems (or for most) is great, but it’s not enough. You need something extra on top of those automated systems, and that something extra comes from the “human touch”. Through proactivity, and the experience that oXya has in handling these sophisticated landscapes, we provide this extra layer of security for our customers.oXya1 (Custom).jpg

 

What do I mean by “human touch”?

 

At oXya, of course we have all of the possible monitoring on the Production landscapes, including automated monitoring. However, we’re not settling with these alone. The “human touch” aspect is that on a daily basis, we have people logging into these systems, performing manual daily checks, and making sure that the systems are up, running, and in good condition. We believe that in addition to the automated checks, that are very thorough, what really ensures that everything is being caught is having human interaction with the systems, on a daily basis; and if something is being caught, then it is handled immediately by SAP experts.

 

Furthermore, testing is very important, especially with patches, to prevent unplanned outages. Make sure that both the hardware (firmware) and software are up to date, and that these updates are done through rigorous testing. If there’s a database patch that needs to be applied, for example, we would coordinate that very carefully with the business; if there is a sandbox environment then we’ll do that first, to have less impact on the business.

 

To summarize, I see four key elements to omitting unplanned downtime: patching, monitoring, reactivity to issues, and lots of proactivity. These are the key elements that oXya has been following over the years, and that have led to having lots of success with our customers.

 

------------------------------------

 

Sevag Mekhsian is a Service Delivery Manager at oXya, a Hitachi Data Systems company. Sevag is an SAP expert with more than eight years of experience, the last four of which leading a team of team of ten SAP professionals, at oXya’s service center in New Jersey.

Outcomes