Michael Sharp Posted August 14, 2017 Share Posted August 14, 2017 Please see below message when trying to log into Hornbill - this is affecting all users. Please can this be addressed urgently? Cheers, Mike. Link to comment Share on other sites More sharing options...
Michael Sharp Posted August 14, 2017 Author Share Posted August 14, 2017 This seems to have now remedied itself.... Link to comment Share on other sites More sharing options...
Guest Paul Alexander Posted August 14, 2017 Share Posted August 14, 2017 We've just had the same problem.......it's rectified now but we all got kicked out for a while. Link to comment Share on other sites More sharing options...
nasimg Posted August 14, 2017 Share Posted August 14, 2017 Yes we had the same thing occur - its "fixed" itself too. Appreciate an update from Hornbill. Link to comment Share on other sites More sharing options...
Alex8000 Posted August 14, 2017 Share Posted August 14, 2017 Someone must've pushed the big red button! Good that everything is back up with only ~2 min downtime. Link to comment Share on other sites More sharing options...
Victor Posted August 14, 2017 Share Posted August 14, 2017 @Michael Sharp @Paul Alexander @nasimg @Alex8000 the brief downtime was caused by a minor infrastructure change that due to an error it affected live instances for about 2-3 min, as it was corrected almost immediately but our infrastructure team. Apologies for inconveniences caused. 1 Link to comment Share on other sites More sharing options...
Keith Stevenson Posted August 14, 2017 Share Posted August 14, 2017 All At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications. The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan). The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing. Inorder to prevent issues like this in future we are already looking at having additional tests in place. Kind Regards 2 Link to comment Share on other sites More sharing options...
Michael Sharp Posted August 14, 2017 Author Share Posted August 14, 2017 43 minutes ago, Keith Stevenson said: All At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications. The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan). The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing. Inorder to prevent issues like this in future we are already looking at having additional tests in place. Kind Regards Hi @Keith Stevenson, might I suggest your infrastructure and the product is updated outside of normal business hours? I imagine the platform is critical for your customers' support teams and would have thought this is an easy mitigation process? As a firm for example, we commit all business critical changes during evenings and weekends to avoid any potential loss of productivity. Fortunately on this occasion, the roll back was relatively quick....! Regards, Mike. Link to comment Share on other sites More sharing options...
Steve Giller Posted August 14, 2017 Share Posted August 14, 2017 @Michael Sharp - but for which time zone? It's a global 24/7 product, there are no "business hours" Link to comment Share on other sites More sharing options...
Michael Sharp Posted August 14, 2017 Author Share Posted August 14, 2017 @DeadMeatGF ideally carried out at 00:00 BST which would avoid critical business hours for the vast majority of firms globally...... Link to comment Share on other sites More sharing options...
Gerry Posted August 14, 2017 Share Posted August 14, 2017 @Michael Sharp The change was on the face of it an innocuous change to allow Etags to be served for configuration information, it should not have had any impact on production systems based on our assessment of the change before it was applied. Now the fact that it did have an impact is a problem because there is now something about the way our platform is functioning that we do not properly understand. We will of course now investigate this and determine why this happened and what we don't understand. The essence of the problem is this - each instance reads its internal configuration from one of two web services (we have two for redundancy) and for some reason the change effected our application codes ability to read its configuration data. Now this system is designed to be fault tolerant inasmuch as in the event of being unable to read configuration data, it should continue with the configuration data it already has in memory. Clearly this did not happen so actually this has highlighted another problem. Rest assured we will of course investigate, get to the bottom of why this is and of course do something about it. So I want to assure you, we do not make changes during production times without considering risk, we were not expecting any problems and for what ever reason we got that wrong this time so we can only apologise, but I would like to take this opportunity to say that we make many changes a week as we evolve and improve our service for our customers, we get it right almost all of the time, it is just when we do get it wrong its very visible. As a point of interest and of transparency I will post pack here with an update once we have done a post-mortem and figured out what went wrong. Gerry 1 Link to comment Share on other sites More sharing options...
Gerry Posted August 14, 2017 Share Posted August 14, 2017 A quick update. We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future. However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible. This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem. You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride Gerry 1 Link to comment Share on other sites More sharing options...
Michael Sharp Posted August 14, 2017 Author Share Posted August 14, 2017 Thanks @Gerry, great to know Hornbill has a road map for future upgrades. Regards, Mike. Link to comment Share on other sites More sharing options...
Michael Sharp Posted August 14, 2017 Author Share Posted August 14, 2017 7 minutes ago, Gerry said: A quick update. We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future. However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible. This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem. You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride Gerry Hi Gerry, Fantastic news and also refreshing to see such a quick turnaround from a vendor. My issues weren't really in relation to today's outage, more the perceived presumption your systems are immune from the unexpected! Regards, Mike. Link to comment Share on other sites More sharing options...
Gerry Posted August 15, 2017 Share Posted August 15, 2017 @Michael Sharp Our systems are not immune to the unexpected, but we aim to minimise the unexpected always - but we don't get everything right. The strategy though is to have enough command over our systems that should we get something wrong we can fix it very quickly, thats what enables us to move forward quickly - in Agile terms it called "failing forwards". You might also like to know that from a platform point of view, when it comes to defects that have a material impact on production we do not even have a concept of logging a defect in the traditional sense - there is no concept of a queue or backlog, we fix it there and then, a fix generally includes the following steps: - Diagnose based on postmortem, we aim to understand the problem without the need to re-produce it With the understanding of the problem gained we do a code review and a code fix, and in parallel to this happening... We make a test case which fails on the current (pre-fix) code stream We then build and push to 'test' where our tests run and the new code is verified to pass And we loop around of we find any problem, either with the fix we applied or any other problem we inadvertently caused In the core code (C++) domain, once we fully understand the problem, we can be out to test with a fix often within 30 minutes or so. Our standard practice once all tests are passing on our test stream, is 24 hours on 'dev', 48 hours on 'beta' and then we push to live. This pipeline is continuously rolling most of the time and in practice we have multiple fixes per build too. We will also from time to time, maybe 4-5 times a year we might do a hot-fix where a very specific single code change is pushed straight into production, but thats a lot of hard work that is a far more manual process so we avoid this scenario at all costs. The one other thing we do is we never branch code into production, that means we *always* build from trunk and trunk is always buildable and releasable. If we do introduce a problem along the way we fix and push out rather than roll-back in almost every single case. So I hope you can see, we don't make any assumptions about the robustness of our platform but we take providing a high quality robust service very seriously. Gerry Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now