Unable to Access HB

Michael Sharp · August 14, 2017

Please see below message when trying to log into Hornbill - this is affecting all users. Please can this be addressed urgently?

Cheers,

Mike.

Michael Sharp · August 14, 2017

This seems to have now remedied itself....

August 14, 2017

We've just had the same problem.......it's rectified now but we all got kicked out for a while.

nasimg · August 14, 2017

Yes we had the same thing occur - its "fixed" itself too. Appreciate an update from Hornbill.

Alex8000 · August 14, 2017

Someone must've pushed the big red button! Good that everything is back up with only ~2 min downtime.

Victor · August 14, 2017

@Michael Sharp @Paul Alexander @nasimg @Alex8000 the brief downtime was caused by a minor infrastructure change that due to an error it affected live instances for about 2-3 min, as it was corrected almost immediately but our infrastructure team. Apologies for inconveniences caused.

Keith Stevenson · August 14, 2017

All

At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications.

The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan).

The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing.

Inorder to prevent issues like this in future we are already looking at having additional tests in place.

Kind Regards

Michael Sharp · August 14, 2017

43 minutes ago, Keith Stevenson said:

All

At around 10:35 this morning, during a routine update to a backed server an error occurred in our subscription checker. This caused a number of instances to temporarily loose the count of available subscriptions for a subset of applications.

The error was quickly identified and resolved by 10:37 (Change backed out as per change request plan).

The type of update that occurred on the backend server happens around once a week as we continuously develop\enhance our service and this is the first time an error has been witnessed. The issue was not seen in our development or beta environments during testing.

Inorder to prevent issues like this in future we are already looking at having additional tests in place.

Kind Regards

Hi @Keith Stevenson, might I suggest your infrastructure and the product is updated outside of normal business hours? I imagine the platform is critical for your customers' support teams and would have thought this is an easy mitigation process?

As a firm for example, we commit all business critical changes during evenings and weekends to avoid any potential loss of productivity. Fortunately on this occasion, the roll back was relatively quick....!

Regards,

Mike.

Steve Giller · August 14, 2017

@Michael Sharp - but for which time zone? It's a global 24/7 product, there are no "business hours"

Michael Sharp · August 14, 2017

@DeadMeatGF ideally carried out at 00:00 BST which would avoid critical business hours for the vast majority of firms globally......

Gerry · August 14, 2017

@Michael Sharp

The change was on the face of it an innocuous change to allow Etags to be served for configuration information, it should not have had any impact on production systems based on our assessment of the change before it was applied. Now the fact that it did have an impact is a problem because there is now something about the way our platform is functioning that we do not properly understand. We will of course now investigate this and determine why this happened and what we don't understand.

The essence of the problem is this - each instance reads its internal configuration from one of two web services (we have two for redundancy) and for some reason the change effected our application codes ability to read its configuration data. Now this system is designed to be fault tolerant inasmuch as in the event of being unable to read configuration data, it should continue with the configuration data it already has in memory. Clearly this did not happen so actually this has highlighted another problem. Rest assured we will of course investigate, get to the bottom of why this is and of course do something about it.

So I want to assure you, we do not make changes during production times without considering risk, we were not expecting any problems and for what ever reason we got that wrong this time so we can only apologise, but I would like to take this opportunity to say that we make many changes a week as we evolve and improve our service for our customers, we get it right almost all of the time, it is just when we do get it wrong its very visible.

As a point of interest and of transparency I will post pack here with an update once we have done a post-mortem and figured out what went wrong.

Gerry

Gerry · August 14, 2017

A quick update.

We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future.

However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible. This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem.

You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride

Gerry

Michael Sharp · August 14, 2017

Thanks @Gerry, great to know Hornbill has a road map for future upgrades.

Regards,

Mike.

Michael Sharp · August 14, 2017

7 minutes ago, Gerry said:

A quick update.

We have now identified the root cause of the problem which was a configuration problem that caused the configuration data content from being unavailable to our application services. This in its self should not have caused production systems to fail. However, this condition showed up another previously unknown problem. We have now identified this primary problem and are currently adding monitoring to these service for this particular condition so our systems will identify the problem automatically should it happen in the future.

However, In the case of the secondary problem where the configuration information checksum data is unavailable the application servers were also throwing an error, this was a design issue as the code should have been far more graceful in its handling of this condition. As a result, we have now added code that will ensure that application services will give 8 hours of grace time should any configuration information be inaccessible. This was a problem waiting to happen as the condition and the subsequent behaviour was previously un-encountered and therefore unknown to us which means it would have caught us out at some point. The change will be verified, tested and rolled out into production over the next 72 hours ensuring that production instances will no longer be affected by our internal configuration services having a problem.

You have to love continuous deployment and the speed with which we can identify, solve and release solutions. Once again our apologies for any impact caused, even though it was less that two minutes of down time, it still hurts our pride

Gerry

Hi Gerry,

Fantastic news and also refreshing to see such a quick turnaround from a vendor. My issues weren't really in relation to today's outage, more the perceived presumption your systems are immune from the unexpected!

Regards,

Mike.

Gerry · August 15, 2017

@Michael Sharp

Our systems are not immune to the unexpected, but we aim to minimise the unexpected always - but we don't get everything right. The strategy though is to have enough command over our systems that should we get something wrong we can fix it very quickly, thats what enables us to move forward quickly - in Agile terms it called "failing forwards".

You might also like to know that from a platform point of view, when it comes to defects that have a material impact on production we do not even have a concept of logging a defect in the traditional sense - there is no concept of a queue or backlog, we fix it there and then, a fix generally includes the following steps: -

Diagnose based on postmortem, we aim to understand the problem without the need to re-produce it
With the understanding of the problem gained we do a code review and a code fix, and in parallel to this happening...
We make a test case which fails on the current (pre-fix) code stream
We then build and push to 'test' where our tests run and the new code is verified to pass
And we loop around of we find any problem, either with the fix we applied or any other problem we inadvertently caused

In the core code (C++) domain, once we fully understand the problem, we can be out to test with a fix often within 30 minutes or so. Our standard practice once all tests are passing on our test stream, is 24 hours on 'dev', 48 hours on 'beta' and then we push to live. This pipeline is continuously rolling most of the time and in practice we have multiple fixes per build too. We will also from time to time, maybe 4-5 times a year we might do a hot-fix where a very specific single code change is pushed straight into production, but thats a lot of hard work that is a far more manual process so we avoid this scenario at all costs.

The one other thing we do is we never branch code into production, that means we *always* build from trunk and trunk is always buildable and releasable. If we do introduce a problem along the way we fix and push out rather than roll-back in almost every single case.

So I hope you can see, we don't make any assumptions about the robustness of our platform but we take providing a high quality robust service very seriously.

Gerry

Unable to Access HB

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Guest Paul Alexander

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in