Jump to content

Service outage across several instances [RESOLVED]


davidrb84
Message added by Victor

We are experiencing issues with the service across several instances

The issue is now resolved. RCA is now complete and available.

Please follow https://status.hornbill.com/

Recommended Posts

  • Victor changed the title to Service outage across several instances [FIX IN PROGRESS]
  • Victor changed the title to Service outage across several instances [RESOLVED]

At 14:40 GMT for around 30 Minutes a large number of customer Instances were unavailable. The problem was caused by a deployment issue with our configuration data, which prevented our nodes from serving API requests. This meant that customers were unable to log in or access any Hornbill services once the backup caches were timed out.

TimeLine
14:36 - Change made to configuration database which is replicated to all data centers within 5 minutes.
14:40 - Instances began taking on new configuration data, which was mal-formed, causing instances to fail and unable to respond to API request
14:45 - Root Cause Identified and located
14:58 - Fix Implemented and deployed
15:00 - 15:19 All Instances services were recovered and restarted
15:20 – Service across all regions and data centers fully restored and verified

Description
One of the most critical elements of our service is our configuration infrastructure. It’s a distributed system with high levels of redundancy in every data center, but in this instance, the structural integrity of the configuration data (XML content) was broken, and this caused an almost immediate problem for every instance. There is a redundancy scheme in place to ensure that should configuration information be unavailable our nodes will retain cached copies for up to 8 hours in memory, this should have prevented an immediate problem, but unfortunately this unusual event showed up a secondary problem which meant the cached configuration was not used, causing many processes to fail.

As a result of this outage, we have made a number of important changes to our configuration infrastructure in order to remove any possibility of this particular problem from occurring again in the future.
 

  • We have added integrity checks on configuration data generated from our CMDB before it is written to disk, this will ensure that no broken config data can get replicated beyond our distribution servers.
  • We have fixed a defect in our handling of configuration data to ensure the cached data will be used in the case of a config data error, as per the original design intention.
  • We are reviewing our operating procedures for restarting processes to reduce the time taken.


Ultimately, this service outage was caused by a software change which we made which had an unexpected knock-on effect which we had believed we were resilient to, this turned out not to be the case. We take the quality of our service very seriously and pride ourselves on having a very high-quality service from a reliability and availability point of view. We would like to apologize unreservedly for both the outage and for any disruption caused, we have done everything we can now to ensure this cannot reoccur, thank you for your patience and understanding.

  • Like 1
  • Thanks 3
Link to comment
Share on other sites

  • Victor locked this topic
  • Victor unpinned and unfeatured this topic
Guest
This topic is now closed to further replies.
×
×
  • Create New...