Jump to content

Unable to Access HB


Michael Sharp

Recommended Posts

All

At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications.

The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan).

The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing.

Inorder to prevent issues like this in future we are already looking at having additional tests in place.

 

Kind Regards

  • Like 2
Link to comment
Share on other sites

43 minutes ago, Keith Stevenson said:

All

At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications.

The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan).

The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing.

Inorder to prevent issues like this in future we are already looking at having additional tests in place.

 

Kind Regards

Hi @Keith Stevenson, might I suggest your infrastructure and the product is updated outside of normal business hours?  I imagine the platform is critical for your customers' support teams and would have thought this is an easy mitigation process?  

As a firm for example, we commit all business critical changes during evenings and weekends to avoid any potential loss of productivity.  Fortunately on this occasion, the roll back was relatively quick....!

Regards,

Mike.

Link to comment
Share on other sites

@Michael Sharp

The change was on the face of it an innocuous change to allow Etags to be served for configuration information, it should not have had any impact on production systems based on our assessment of the change before it was applied. Now the fact that it did have an impact is a problem because there is now something about the way our platform is functioning that we do not properly understand. We will of course now investigate this and determine why this happened and what we don't understand. 

The essence of the problem is this - each instance reads its internal configuration from one of two web services (we have two for redundancy) and for some reason the change effected our application codes ability to read its configuration data. Now this system is designed to be fault tolerant inasmuch as in the event of being unable to read configuration data, it should continue with the configuration data it already has in memory. Clearly this did not happen so actually this has highlighted another problem.  Rest assured we will of course investigate, get to the bottom of why this is and of course do something about it. 

So I want to assure you, we do not make changes during production times without considering risk, we were not expecting any problems and for what ever reason we got that wrong this time so we can only apologise, but I would like to take this opportunity to say that we make many changes a week as we evolve and improve our service for our customers, we get it right almost all of the time, it is just when we do get it wrong its very visible. 

As a point of interest and of transparency I will post pack here with an update once we have done a post-mortem and figured out what went wrong.

Gerry

  • Like 1
Link to comment
Share on other sites

A quick update. 

We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future. 

However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible.  This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem. 

You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride :) 

Gerry

 

  • Like 1
Link to comment
Share on other sites

7 minutes ago, Gerry said:

A quick update. 

We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future. 

However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible.  This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem. 

You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride :) 

Gerry

 

Hi Gerry,

Fantastic news and also refreshing to see such a quick turnaround from a vendor.  My issues weren't really in relation to today's outage, more the perceived presumption your systems are immune from the unexpected!

Regards,

Mike.

Link to comment
Share on other sites

@Michael Sharp

Our systems are not immune to the unexpected, but we aim to minimise the unexpected always - but we don't get everything right. The strategy though is to have enough command over our systems that should we get something wrong we can fix it very quickly, thats what enables us to move forward quickly - in Agile terms it called "failing forwards".   

You might also like to know that from a platform point of view, when it comes to defects that have a material impact on production we do not even have a concept of logging a defect in the traditional sense - there is no concept of a queue or backlog, we fix it there and then, a fix generally includes the following steps: -

  • Diagnose based on postmortem, we aim to understand the problem without the need to re-produce it
  • With the understanding of the problem gained we do a code review and a code fix, and in parallel to this happening...
  • We make a test case which fails on the current (pre-fix) code stream
  • We then build and push to 'test' where our tests run and the new code is verified to pass
  • And we loop around of we find any problem, either with the fix we applied or any other problem we inadvertently caused
In the core code (C++) domain,  once we fully understand the problem, we can be out to test with a fix often within 30 minutes or so.  Our standard practice once all tests are passing on our test stream, is 24 hours on 'dev', 48 hours on 'beta' and then we push to live.   This pipeline is continuously rolling most of the time and in practice we have multiple fixes per build too.  We will also from time to time, maybe 4-5 times a year we might do a hot-fix where a very specific single code change is pushed straight into production, but thats a lot of hard work that is a far more manual process so we avoid this scenario at all costs. 
 
The one other thing we do is we never branch code into production, that means we *always* build from trunk and trunk is always buildable and releasable. If we do introduce a problem along the way we fix and push out rather than roll-back in almost every single case. 
 
So I hope you can see, we don't make any assumptions about the robustness of our platform but we take providing a high quality robust service very seriously. 
 
Gerry
 
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...