Using Systems Failures To Improve Our Service

In this post, one of our DevOps specialists explains a recent incident which occurred on our production systems, and how we used it to improve our service.

How do we encounter systems failure? Sometimes it’s genuinely a rapid phenomenon, but more often it happens, as Hemingway would have it, “in two ways: gradually; then suddenly”. The final failure is often a surprise – an irrecoverable resource shortage, a mistyped command wiping a production database, a massive Denial of Service attack by a malicious actor. Unpredictable, one might say. Unfortunate but statistically unavoidable. This is in part true, but it’s far from the whole story.

Early signs of a problem are often subtle. Somewhat increased memory usage on a production cluster with no obvious increase in number of services. A few more servers being created by the autoscaler, at times when high usage is unexpected. Some systems, or the maintainers of those systems, will notice these irregularities and raise the alarm, but a significant majority will either not notice or dismiss it as the normal variation that all metrics driven by user activity undergo. Even if noticed and flagged, it can take quite some time to recognise that an anomaly represents a potential problem. In part this is because any given anomaly may not represent a problem; a little more or less time or resource usage than expected may be neither here nor there, in isolation, especially for a growing application which is gathering additional responsibilities.

In mid-November we at Market Dojo encountered one of these situations. It was noted that we were seeing slight additional RAM utilisation on one of our test environments. The situation didn’t worsen and no particular alarm was generated. One of our staff did look into the situation, as we always have an eye to the performance of the application, but didn’t designate it as a significant problem. The software was released, and all seemed to be as it should, for a while. That release included a new feature which retrieves and displays a timeline of all the events which concern a user. After a few days of production use, we realised we were beginning to see concerningly high RAM usage, and spent some time to look into it. Unfortunately, even at this point we were investigating in a spirit of performance and code quality, rather than with urgency and expectation of imminent problems.

We saw the first minor performance degradation for some users shortly after this, which triggered our incident response procedure, dedicating someone to investigate the cause, the circumstances and potential rapid mitigation. The available evidence was thin – consequences but unclear causes; slow requests that could be either the cause or the effect. After some digging, and with full attention of our internal DevOps personnel, we narrowed it down to a particular API endpoint which appeared to be overusing resources on each request, with cumulative effects. By this point we had encountered a major performance degradation for a small number of users, during which services were entirely disrupted. What had been a slow burn over a few days of “that’s funny…” and “I wonder why…?” had, seemingly suddenly, become an acute and fatal incident which we scrambled to identify, mitigate, document and patch.

Elapsed time from being sure there was a problem to having issued a code and infrastructure update which both prevented the proximate cause and mitigated the worst of the effects should any other cause arise was approximately two days, but the underperforming code had been in production for nearly two weeks. The awareness of a possible fault, and the underlying resource leakage itself, had grown gradually. The systems failure was sudden, unexpected and total when it occurred.

In this case, Market Dojo personnel were able to apprehend, analyse and mitigate the problem in a relatively short time, largely due to having solid processes in place for responding to service interruptions. Our ability to scale our platform to accommodate almost any load enabled us to force a short term oversupply of the overused resources, minimising customer impact once the cause was identified. Our caution around performance meant that by the time the anomaly became a problem it was already under part time investigation, with an individual available who knew where to look. Our failure to treat the anomaly as a serious issue in and of itself, however, caused avoidable downtime. The frog knew it was getting warm, but still got scalded.

A failure to recognise and react with sufficient attention to these signs can become a habit. An anomaly becomes business as usual, or if not one particular anomaly then the perception of the system as ‘flakey’, or of a metric as ‘variable, but it has always been that way’ can become ingrained, such that it becomes the new baseline against which other anomalies are measured. In software we most commonly encounter this in terms of performance – memory usage, CPU usage, webpage load speed – but in other fields the warnings might be an increase in intercepted ‘chatter’, or the annoying but ignorable whine of a pump running beyond its designed throughput. For procurement personnel, it might be a wobble in the share price of a key supplier, or news of political or geological unrest in a country which produces a vital raw material. These things may not individually be a cause for panic, but we ignore or downplay them at our peril.

As we grow, with ever larger clients and worldwide scope, we encounter more risks to consider, intrinsically coupled with more opportunities to improve our platform’s scope and resilience. Avoiding this habit and taking advantage of those opportunities is therefore extremely important, so at Market Dojo we try to treat anomalies as warnings so that we are able to flag the strange and unusual as risk factors. In the majority of cases the cause is identified and the potential issue removed long before it has a chance to reify. On those occasions when an issue grows slowly enough or seems harmless enough to slip past our checks, we endeavour to take swift and decisive action as soon as it rises to our attention.

Using Systems Failures To Improve Our Service

BOOK A DEMO

Contact us

Contact Us