Jump to content

IMPORTANT! Hornbill Instances Unavailable [RESOLVED]


Victor

Recommended Posts

We are experiencing a connectivity issue across all Hornbill instances. We are working to have this fixed now investigate where the issue is. This is now fixed.

Please follow status.hornbill.com for further updates on this issue.

 

Issue started at: 16:12
Issue reported at: 16:18
Issue found at: 16:20
Issue fixed at: 16:21 - 16:25

Current Status: RCA complete

Note: some (if not all) instances should be up and running. The issue lays with our infrastructure configuration but I do not have the details yet. Bing the nature of the issue I am not ruling out a possible recurrence albeit unlikely to happen. But just so you know in case.

  • Thanks 1
Link to comment
Share on other sites

Our infrastructure team is investigating. At this point, I am unable to say where the issue occurred/occurs (i.e. if is our internal systems or something external). I will keep updating this therad as we get more informations on this.

Link to comment
Share on other sites

We identified what caused the issue, we are working on a fix now. Some instances are up and running (all of them should be really) but I am not ruling out a recurrence of the issue so please bear with us for a bit longer.

Link to comment
Share on other sites

@Joyce the info is in the post I made. The issue occurred at 16:18 and we found the cause for it at 16:25 when we started working to have it fixed. As I mentioned, it is possible that the issue might reoccur (still waiting for updates on this myself) but it should have been back around that time. So my estimate now is a 10 min downtime.

I'll post more updates as I get more info from our teams.

Link to comment
Share on other sites

@all

The issue was caused by a failure in our CMDB infrastructure system which affected all Hornbill instances. The issue occurred at 16:12 and we started to fix it at 16:23. We are still analysing the issue to ensure everything is stable and the service is available for all instances but I will venture to say the service should be now fully working for all instances.

Speaking for Hornbill team, I want to express my apologies for all the inconveniences this issue caused.

Will continue to post further updates as I get them.

Link to comment
Share on other sites

@all The RCA for this event is now complete. Please find the details below:

On 24/04/2018 at 16:13 our instance monitoring systems and customers started to see failures when attempting to connect to Hornbill instances causing a temporary service disruption. Customers would have noticed timeouts or failures to complete operations with various errors and this was the result of errors being returned from Hornbill configuration servers (these servers provide information to each customer instance\portal of their infrastructure resources). At 16:19 the errors began to clear and by 16:21 the service was restored and fully available in all instances.

The root cause was a failure of the tool Hornbill uses to
synchronise the Hornbill instance configuration files from the master repository to each of the primary\secondary configuration servers found in each data center (the synchronisation happens every 5 minutes). This resulted in the instance configuration files being temporarily unavailable.

In order to prevent a future occurrence of this issue, we have improved the
synchronisation process to ensure that each configuration server is synchronised in sequence rather than parallel and then confirm each synchronisation process before moving to the next server to ensure that there will always be one set of configuration files intact.

Link to comment
Share on other sites

  • Victor unfeatured this topic
  • Victor unpinned this topic

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...