Jump to content

Keith Stevenson

Root Admin
  • Content count

    2,237
  • Joined

  • Last visited

  • Days Won

    11

Keith Stevenson last won the day on August 17

Keith Stevenson had the most liked content!

Community Reputation

30 Excellent

About Keith Stevenson

  • Rank
    Senior Member

Profile Information

  • Gender
    Male
  • Location
    London
  • Interests
    SupportWorks, AssetWorks, Sci-Fi

Contact Methods

  • Website URL
    http://www.hornbill.com
  • ICQ
    0

Recent Profile Visitors

842 profile views
  1. Nasim, Thanks for the confirmation. Unfortunately the error message is somewhat generic and doesnt relate to the issues last week (See https://community.hornbill.com/topic/11663-hornbill-connectivity-issues/ for overview of those. We will continue our investigation and look at a less agnostic error message. Kind Regards Keith Stevenson
  2. Nasim, We have looked at your instance and can see people\analysts performing requests at a fair rate. Can you confirm that this is still effecting people and if so does it effect all analysts. Kind regards Keith Stevenson
  3. Hornbill Connectivity Issues

    As with any customer impacting issue (Performance or Outage) we conduct a review\post mortem of the event to understand the root cause, identify any changes that are needed and put in place steps to ensure that the likelihood of a similar event in the future is reduced. This week we have had three short time windows during which there are around 10% of our customers in production would have experienced issues (Either slowness or timeout errors), so wanting to be totally transparent we wanted to publish more information about these three incidents. Hornbill has seen significant take-up as our customers adopt and expand use and find new and interesting ways of using our platform, API’s and integrations. We monitor all aspects of our systems and application performance and take the quality of the service very seriously. The ever-evolving patterns of use mean we are always tuning and optimising our systems. We are currently in the process of re-structuring our platform because we are transitioning to a 100% micro services architecture, a fundamental function of which has to be service discovery and to properly facilitate that we have had to roll out a new TCP addressing scheme. In parallel with that we are periodically introducing new compute resources (servers, storage etc.) to ensure we can take advantage of the latest hardware to get optimum performance. It is natural for a service provider like us to make these types of changes and its generally totally transparent to our customers. However, the overall root cause for the issues was ultimately a combination of unexpected individual outcomes of the work we are doing in these areas. In summary, issues would have effected around 5% in of our EU customers for a total of 2 Hours during which time they would have seen intermittent errors or slow performance (Around 2% would have experienced both issues), so while on the face of it these were all the same problem, there were actually three completely separate issues that looked like the same thing. HTL INCIDENT REPORT 131120170001 At 14:00 on Monday 13th Nov 2017 our monitoring systems detected slow queries on one of our primary database servers, the root cause was found to be exceptional disk load on its hypervisor due to migration of some services from one HV to another. Under normal circumstances a migration of this nature would be fine and quite fast, but there was also a high load (IOPs) on the underlying storage which slowed things right down. It was unwise to terminate the migration once started and so we had to make the decision to let it complete. The action was completed at 15:48 and all monitors reported normal. We do not migrate services like this every day but the constant storage writes and the impact this had on the performance of the service for some customers was unexpected. We have now changed our process such that any virtual machine with either a disk over 100GB or high IOPs will not be moved during core hours for its physical location. We have good statistical information about when services are busy and quite so this is easy to plan for. HTL INCIDENT REPORT 141120170001 At 09:45 on Tuesday 14th Nov 2017 a subset of customers started to report issues with failures due to some SQL queries failing. It was the same set of customers impacted as 131120170001 and the cause was down to zombied database connections. In our platform, all SQL queries are run over HTTP using our own in-house developed HTTP <-> MySQL proxy. The proxy handles connection pooling, reuse and database security. As a result of the migration of services and IP changes, our data service proxy ended up with TCP connections that it though was still active, but were not. A simple restart of the data proxies resolved this issue but it took us a few minutes minutes to figure out what was happening. The resolve was to restart the Data services on the effect nodes and all connections were resolved by 09:56. Post incident, we are making some code changes to the data service proxies to deal with unexpected zombie connections under these specific circumstances. This fix will be rolled out in the next week or so. HTL INCIDENT REPORT 141120170002 At 17:20 on Tuesday 14th Nov 2017 monitoring detected Slow queries on one of our database servers, the root cause was found to be a virtual disk lock applied by XFS which was preventing subsequent writes. All queries were therefore stacked as pending in memory. The system was performing slowly because it was reaching memory limits and swapping, which was slowing things down. Once we recognised the problem and cleared the lock at 17:26 and all monitors reported normal. The lock was caused, in part, by the original issue on Monday and the creation of the large delta file for the migrated virtual machine. Post incident a new monitoring check has been added to specifically look at the XFS cache\locks and alert any issues before it would impact in the way it did. HTL INCIDENT REPORT 151120170001 At 19:01 on Thursday 16th Nov 2017 customers reported that new login requests were unsuccessful. Existing connections remained functional. The root cause was found to be related to the IP address changes that were made earlier in the day, which were undetected until the DNS cache expiry around 19:00. There appears to be a reference in our frontend application code to a legacy API endpoint name which we made change to some months ago, the new IP scheme did not include this legacy endpoint. Post incident we have reduced the TTL on all DNS records internally and have reviewed our change process with the additional step of any IP changes requiring a DNS. We are not expecting a recurrence of this type of problem because have no expectation there will ever be a requirement to change our IPv4 addressing scheme in the future. We apologise unreservedly for any inconvenience caused during this week, and we will continue to do everything in our power to ensure we do not see a repeat of the above issues. The nature of IT means these sorts of things happen from time to time, and when they do we take these things very seriously and worry about them so our customers don’t ever have to.
  4. email issue

    Giuseppe, Thanks for the reply. From the logs we can see that this stopped working at 00:10 this morning. Can you confirm that you can login to Outlook Portal as servicedesk@datalogic.com (rather than login as another user and switch to the mailbox). If so then re-entering the Password on the mail connector properties in Hornbill Admin might resolve this. Kind regards Keith Stevenson
  5. email issue

    Giuseppe, Thanks for the post. The error message is stating that Outlook cannot verify the Login credentials supplied. This may occur as the account has changed or password expired. This is a MS office account error. You should be able to confirm this with your Office365 Admin Kind regards
  6. Issues all over the place since last update

    Lyonel, Can you confirm that this is now working again. Kind Regards
  7. Unable to Access HB

    All At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications. The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan). The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing. Inorder to prevent issues like this in future we are already looking at having additional tests in place. Kind Regards
  8. All, This has now been resolved. Please check and confirm. We will provide a full explanation shortly Kind Regards Keith Stevenson
  9. Extremely High Bandwidth from Mailbox to Hornbill

    Micheal, Thanks for the reply. Your instance is configured for us to Download email from your internal mail server via IMAP, so we would be connecting\pulling and uploading nothing (I suppose your router may count the commands we use to connect\view IMAP mailbox and download anything it finds as uploads but thats going to be far less than 37.4GB - probably < 10mb for entire day). Can you provide the report as a private message? Perhaps its the wording thats wrong? Kind Regards Keith Stevenson
  10. Extremely High Bandwidth from Mailbox to Hornbill

    Micheal, Thanks for the post. Can you confirm where those stats come from? Kind Regards
  11. unable to log on to Service Manager

    Prathmesh, Thanks for the confirmation. We noticed at 8:39 that one of our nodes was acting strange and refusing connections. Investigations showed that this was the result of a deadlock on the back end database. The lock was cleared around 8:48 and once done your instance could again connect. We are investigating the root cause and will add additional monitoring to ensure that this will not occur again. Kind Regards Keith Stevenson
  12. Hornbill Down?

    All, Following the above we have now made changes to the Analytics engine and Admin tool used to manage this (Prevents multiple joins that would return an extraordinary number of records). This is currently on our beta testing stream and once tested will be rolled out to all. We are also changing the way services handle large data sets (In excess of 30GB) and adding additional self correcting mechanisms to avoid issues (These will be rolled out as part of our continuous delivery of the weeks ahead) . Looking even further ahead we already have plans to overhaul reporting that will drastically reduce the impact of any large data sets. Kind Regards Keith Stevenson
  13. Hornbill Down?

    All, We have just witnessed the issue again (Same subset of customers). This occurrence of the issue has now been resolved. The root cause is that a set of temporary resources was exhausted causing a bottleneck whilst other requests waited. The resolve was to clear the block. This issue has become apparent due to a recent addition to the Service Manager application and advance analytics, which allows the creation of free form reports\joins. If these are not carefully crafted they can create unexpected large volume sets. We are working to identify any instance that may suffer from this issue and will be contacting the primary\secondary contacts for each instance to provide guidance on report creation whilst we investigate a more permanent solution. We apologies for any inconvenience. Kind Regards Keith Stevenson
  14. Hornbill Down?

    All, Thanks for the posts. We can inform you that our monitoring system detected an issue at 13:05 which effected a subset of instances . The root cause was identified and steps taken to resolve at 13:08, however it was then also found that this required a restart of 1 of the backened servers which took approximately 5 mins to restart. All services were fully restored at 13:18. We are still investigating the root cause and will provide a further update with analysis and steps we will undertake to ensure issue cannot happen again shortly. Kind regards
  15. Failed to initialize application - operation timed out

    Alex, Glad to hear that this is resolved. We have now identified the root cause (issue with stale Database connections getting reused rather than creating new ones) and will change our code to ensure that this doesnt happen again. We will also add additional monitoring checks to ensure that if such an event occurred in the future we are notified (and can take action) before it becomes an issue for yourselves. Kind regards Keith Stevenson
×