Service outage across several instances [RESOLVED]

Victor · September 11, 2019

UPDATE:

The issue has been identified and a fix is in progress. Will keep you updated.

Victor · September 11, 2019

UPDATE:

We are fixing each affected instance they will be coming back online one by one in the next 10 - 15 min. Once service has been restored across all instances will post a further update.

yelyah.nodrog · September 11, 2019

Back up and running for us @ GOSH - Thanks for the speedy fix.

Adam@Greggs · September 11, 2019

Greggs back online.... Thanks!

Aaron Summers · September 11, 2019

Back up for us

Steven Cotterell · September 11, 2019

Back working for us also - thanks from us too for the speedy fix.

Jack_Podmore · September 11, 2019

Thank you for the speedy resolution @Victor

HGrigsby · September 11, 2019

Thank you, we are working too now

Stuart Torres-Catmur · September 11, 2019

@Victor Status page won't load. Please could you update here?

Victor · September 11, 2019

UPDATE:

Fully operational service should have now been restored on all affected instances. If anyone experiences any further issues please let us know.

Steven Cotterell · September 11, 2019

Many thanks to everyone at Hornbill who were involved in restoring service for us all.

Marc Littlefair · September 11, 2019

Nice one guys, thanks for the super speedy fix

Stuart Torres-Catmur · September 11, 2019

@Victor Thanks for bringing it back. Still can't access Status site

Victor · September 11, 2019

@Stuart Torres-Catmur https://status.hornbill.com does not load for you? Or something else? Is there any message/error of sort?

Stuart Torres-Catmur · September 11, 2019

@Victor I don't know what changed, but I can access it now. Thanks

Gerry · September 11, 2019

At 14:40 GMT for around 30 Minutes a large number of customer Instances were unavailable. The problem was caused by a deployment issue with our configuration data, which prevented our nodes from serving API requests. This meant that customers were unable to log in or access any Hornbill services once the backup caches were timed out.

TimeLine
14:36 - Change made to configuration database which is replicated to all data centers within 5 minutes.
14:40 - Instances began taking on new configuration data, which was mal-formed, causing instances to fail and unable to respond to API request
14:45 - Root Cause Identified and located
14:58 - Fix Implemented and deployed
15:00 - 15:19 All Instances services were recovered and restarted
15:20 – Service across all regions and data centers fully restored and verified

Description
One of the most critical elements of our service is our configuration infrastructure. It’s a distributed system with high levels of redundancy in every data center, but in this instance, the structural integrity of the configuration data (XML content) was broken, and this caused an almost immediate problem for every instance. There is a redundancy scheme in place to ensure that should configuration information be unavailable our nodes will retain cached copies for up to 8 hours in memory, this should have prevented an immediate problem, but unfortunately this unusual event showed up a secondary problem which meant the cached configuration was not used, causing many processes to fail.

As a result of this outage, we have made a number of important changes to our configuration infrastructure in order to remove any possibility of this particular problem from occurring again in the future.

We have added integrity checks on configuration data generated from our CMDB before it is written to disk, this will ensure that no broken config data can get replicated beyond our distribution servers.
We have fixed a defect in our handling of configuration data to ensure the cached data will be used in the case of a config data error, as per the original design intention.
We are reviewing our operating procedures for restarting processes to reduce the time taken.

Ultimately, this service outage was caused by a software change which we made which had an unexpected knock-on effect which we had believed we were resilient to, this turned out not to be the case. We take the quality of our service very seriously and pride ourselves on having a very high-quality service from a reliability and availability point of view. We would like to apologize unreservedly for both the outage and for any disruption caused, we have done everything we can now to ensure this cannot reoccur, thank you for your patience and understanding.

Service outage across several instances [RESOLVED]

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites