Is The Cloud Down, Or Just Disconnected?

News Room

Websites fail. We’ve all been exposed to various web-based sites, software and services that have failed to work, seeming to jam, hang or variously corrupt in one form or another. It’s easy to blame the cloud and the hyperscaler Cloud Services Provider (CSP) organizations that deliver these services, or point to some gremlin-like anomaly that has created a bug somewhere down the pipe.

In enterprise software application development and the associated realm data science, we tend to refer to these occurrences as ‘SaaS outages’, when one or more Software-as-a-Service (SaaS) functions fails to make its way out of the datacenter and traverse the expanses of the cloud and the web to arrive on our mobile device, desktop or other machine.

When is an outage not an outage?

But not every SaaS outage is actually an outage. According to Mike Hicks, principal solutions analyst at Cisco ThousandEyes, the inherent internal complexity of cloud applications means that we should look inside what’s actually happening to any given cloud service and application before we make a decision over who to blame, what’s at fault and how to fix it. It’s pretty clear, as SaaS adoption has increased, the more complex and distributed they have become in nature.

“Today, most applications rely on a vast web of interconnected dependencies to function,” Hicks reminds us. “If one of these dependencies [sections, components or libraries of software code required by another part of the software code structure in order to function], like a search function for example, is meddled with through updates or planned maintenance, it can in effect create a single point of failure that can render an application unusable. This is why a single point of failure (SPOF) in applications is often mistaken for an outage.”

What this means is that just because one small part of a now very cloud-interconnected codebase experiences a quirk (let’s say that update got installed, but an incomplete set of software code was delivered, or it was corrupted for some reason), it’s not a question of the ‘cloud being down’, it’s a more subtle internal reason for the whole disconnection. Hicks says that recent disruptions to Slack and X are good examples of this. The inability to send or load messages on Slack and brief server timeouts on X, were initially thought to be the result of back-end connectivity issues. However, after taking a closer look, the Cisco ThousandEyes team say that they saw that user disruption was apparently being caused by bug fixes and system changes to certain functions that were disrupting these services.

“Overcoming this issue starts with engineers and IT teams being able to see the big picture and gaining visibility over where planned maintenance work from other teams might impact an application down the line. This is difficult to do when you don’t have ownership of the infrastructure. Without the right tools to overcome this, IT teams are often left with an onslaught of recurring issues with wider user-facing impacts,” explained Hicks.

The ravages of outages

Through its own research at Cisco ThousandEyes, the team notes that they have seen these SaaS degradations (a perhaps slightly softer term than ‘outage’ as SaaS vendors would argue that many other services are up and online) becoming more frequent as businesses become increasingly dependent on cloud applications, a positive development in one sense, but one that also creates an inherently greater level of complexity all round. Without proper visibility and understanding of these types of system disruptions, these issues will continue to impact business performance.

“Odds are some of an organization’s most business-critical applications today are SaaS apps. Nowadays, even the most traditional on-premises applications have or are starting to transition into SaaS-based offerings. That trajectory is great and there’s no question about cloud-powered apps outperforming legacy apps,” enthused Hicks. “But, as we come to rely on apps that are serviced from SaaS network infrastructure that the enterprise itself does not actually ‘own’ because they are maintained by external service providers outside of the company’s control perimeter, these applications are also connected over cloud and Internet networks that the organization can not see. So then, how do you go about troubleshooting outages and disruptions that are impacting either your users, be they employees or customers?”

Talking from experiences drawn through customer interactions that play out at exactly this level, Hicks points to ‘status pages’ as a good place to start analyzing the state of any given SaaS app. It’s here that we can find a long list of distinct services such as login information, Application Programming Interfaces (APIs), messaging protocols and so on.

Drowning in a ‘sea of green’ indicators

“But, as I’m sure any technical operative has experienced, it’s not uncommon to encounter a status page that displays a ‘sea of green’ indicators, declaring all services to be online and working, despite rising user complaints and the clear presence of issues. Why is that? It is because it’s in the ‘stitching’ within the distributed architecture powering the SaaS app where many issues arise,” clarified Hicks. “In short, monitoring SaaS app infrastructure is crucial, but you need to consider the entirety of the service delivery chain.”

The reality described here is of course real i.e. companies like Cisco ThousandEyes wouldn’t exist if we didn’t need network intelligence and observability controls of this kind to ‘see into’ the increasingly abstracted world of cloud virtualization.

The company itself works to monitor network infrastructure, troubleshoot application delivery and map Internet performance, all from its own SaaS-based platform (which it presumably turns its own mapping controls backwards upon in order to make sure that the mapping process itself doesn’t break too often, if at all). Within its toolbox, Cisco ThousandEyes is able to emulate the experience of real users with technologies that look at interactions such as page loading and analyze multi-step transactions performed by users. This enables its engineers to display snapshots, perform service segmentation and present detailed waterfalls (a waterfall chart is a time-based representation of data that displays the relationship between events) and performance metrics.

Agreeing with many of these comments but preferring to widen and clarify the argument further is Roman Spitzbart, VP EMEA solutions engineering from unified observability and security platform company Dynatrace. Encouraging us to think about system health in the widest sense possible, Spitzbart says that synthetic testing and real user monitoring capabilities are of course essential to IT operations teams’ ability to understand and manage the experience provided by their SaaS applications.

Dynatrace introduced these capabilities to its own platform in 2019 to enable IT departments to monitor SaaS application performance through users’ web browsers. This illuminated what had previously been a ‘black box’ understood only by cloud service providers.

“However, these capabilities are only effective if they are part of a joined-up approach to monitoring. Fragmented monitoring is the old world,” said Spitzbart, definitively. “It simply doesn’t work if you just look at SaaS application performance in isolation, as there are countless other factors that can impact the users’ experience. This could include anything from their network connection, to whether the browser they’re using is up to date, or if the functionality they’re using is reliant on a third-party plug-in from another service provider.”

“To chart a clear path through this complexity, SaaS application monitoring needs to be ingrained as part of a unified approach to observability and security,” Spitzbart added. “Without this, IT teams will be reliant on stitching together insights from multiple monitoring tools to find the answers they need to understand and manage the user experience for their SaaS applications effectively.”

Is your cloud down? Maybe, but probably not, the hyperscalers are really good at cleaning, balancing and strengthening the SaaS delivery pipes. There are plenty of ghosts in any machine, it’s worth looking deep inside first.

Read the full article here

Share this Article
Leave a comment