Jump to content

Hornbill Connectivity Issues


Keith Stevenson

Recommended Posts

As with any customer impacting issue (Performance or Outage) we conduct a review\post mortem of the event to understand the root cause, identify any changes that are needed and put in place steps to ensure that the likelihood of a similar event in the future is reduced. This week we have had three short time windows during which there are around 10% of our customers in production would have experienced issues (Either slowness or timeout errors), so wanting to be totally transparent we wanted to publish more information about these three incidents.

Hornbill has seen significant take-up as our customers adopt and expand use and find new and interesting ways of using our platform, API’s and integrations. We monitor all aspects of our systems and application performance and take the quality of the service very seriously. The ever-evolving patterns of use mean we are always tuning and optimising our systems. We are currently in the process of re-structuring our platform because we are transitioning to a 100% micro services architecture, a fundamental function of which has to be service discovery and to properly facilitate that we have had to roll out a new TCP addressing scheme. In parallel with that we are periodically introducing new compute resources (servers, storage etc.) to ensure we can take advantage of the latest hardware to get optimum performance. It is natural for a service provider like us to make these types of changes and its generally totally transparent to our customers. However, the overall root cause for the issues was ultimately a combination of unexpected individual outcomes of the work we are doing in these areas.

In summary, issues would have effected around 5% in of our EU customers for a total of 2 Hours during which time they would have seen intermittent errors or slow performance (Around 2% would have experienced both issues), so while on the face of it these were all the same problem, there were actually three completely separate issues that looked like the same thing.

HTL INCIDENT REPORT 131120170001
At 14:00 on Monday 13th Nov 2017 our monitoring systems detected slow queries on one of our primary database servers, the root cause was found to be exceptional disk load on its hypervisor due to migration of some services from one HV to another. Under normal circumstances a migration of this nature would be fine and quite fast, but there was also a high load (IOPs) on the underlying storage which slowed things right down. It was unwise to terminate the migration once started and so we had to make the decision to let it complete. The action was completed at 15:48 and all monitors reported normal. We do not migrate services like this every day but the constant storage writes and the impact this had on the performance of the service for some customers was unexpected.

We have now changed our process such that any virtual machine with either a disk over 100GB or high IOPs will not be moved during core hours for its physical location. We have good statistical information about when services are busy and quite so this is easy to plan for.


HTL INCIDENT REPORT 141120170001
At 09:45 on Tuesday 14th Nov 2017 a subset of customers started to report issues with failures due to some SQL queries failing. It was the same set of customers impacted as 131120170001 and the cause was down to zombied database connections. In our platform, all SQL queries are run over HTTP using our own in-house developed HTTP <-> MySQL proxy. The proxy handles connection pooling, reuse and database security. As a result of the migration of services and IP changes, our data service proxy ended up with TCP connections that it though was still active, but were not. A simple restart of the data proxies resolved this issue but it took us a few minutes minutes to figure out what was happening. The resolve was to restart the Data services on the effect nodes and all connections were resolved by 09:56.

Post incident, we are making some code changes to the data service proxies to deal with unexpected zombie connections under these specific circumstances. This fix will be rolled out in the next week or so.


HTL INCIDENT REPORT 141120170002
At 17:20 on Tuesday 14th Nov 2017 monitoring detected Slow queries on one of our database servers, the root cause was found to be a virtual disk lock applied by XFS which was preventing subsequent writes. All queries were therefore stacked as pending in memory. The system was performing slowly because it was reaching memory limits and swapping, which was slowing things down. Once we recognised the problem and cleared the lock at 17:26 and all monitors reported normal. The lock was caused, in part, by the original issue on Monday and the creation of the large delta file for the migrated virtual machine.

Post incident a new monitoring check has been added to specifically look at the XFS cache\locks and alert any issues before it would impact in the way it did.

HTL INCIDENT REPORT 151120170001
At 19:01 on Thursday 16th Nov 2017 customers reported that new login requests were unsuccessful. Existing connections remained functional. The root cause was found to be related to the IP address changes that were made earlier in the day, which were undetected until the DNS cache expiry around 19:00. There appears to be a reference in our frontend application code to a legacy API endpoint name which we made change to some months ago, the new IP scheme did not include this legacy endpoint.

Post incident we have reduced the TTL on all DNS records internally and have reviewed our change process with the additional step of any IP changes requiring a DNS. We are not expecting a recurrence of this type of problem because have no expectation there will ever be a requirement to change our IPv4 addressing scheme in the future.

We apologise unreservedly for any inconvenience caused during this week, and we will continue to do everything in our power to ensure we do not see a repeat of the above issues. The nature of IT means these sorts of things happen from time to time, and when they do we take these things very seriously and worry about them so our customers don’t ever have to.

  • Like 1
Link to comment
Share on other sites

×
×
  • Create New...