Victor Posted January 24, 2020 Posted January 24, 2020 Service is unavailable across multiple instances. We are looking into this. 2
Stephen.whittle Posted January 24, 2020 Posted January 24, 2020 Ours is back up Victor and is seemingly ok performance wise. Are you able to provide any assurance yet with regards to the resolution? 1
Victor Posted January 24, 2020 Author Posted January 24, 2020 The issue which caused the service outage has been fixed. We are investigating to identify what caused the issue and will come back with updates. We are truly sorry for all the troubles this issue has caused. @Stephen.whittle as with any issue like this we need to investigate and understand the cause (what and how). Once this is done we will post the details on https://status.hornbill.com/ and on the forum thread. 1 1
Victor Posted January 24, 2020 Author Posted January 24, 2020 @nasimg your instance should now be back up and running
Victor Posted January 28, 2020 Author Posted January 28, 2020 @all We have now completed the RCA for this incident. Details as follows: On Friday, 24th January, at 10:47, our monitoring systems alerted us to a problem with our configuration servers. The root cause has been identified and relates to an error that caused a partial replication of our configuration database. During this time, some customer instances were effected and exhibited extended API response times, which for all intents made the affected instances unavailable. The problem was detected and rectified by 10:58, 11 minutes after the first alert was raised. We have levels of redundancy built into our configuration database deployment, but this was a new failure scenario we have not previously seen, and our resilience strategy failed under these conditions. We have now implemented a fix and deployed to production to ensure that in the future, under similar circumstances, we will not see the same failure mode. We unreservedly apologize for all inconveniences this caused.
Recommended Posts