Jump to content

CRITICAL - Issue after automated upgrade to build 2892


Keith

Recommended Posts

Hi Chris,

 it won't be strictly related to when the ticket was logged, but that can be a good reference. It's more about the time when the BPM tried to perform the operation that failed. Therefore, if the BPM operation took place prior to the fix being applied, then the BPM will fail. 

If the BPM can be successfully restarted, then that indicates that the underlying cause has been addressed. If you are still experiencing problems with more recent Change requests, this may indicate an issue is still in existence.

I hope that helps,
Dan

Link to comment
Share on other sites

3 hours ago, chrisnutt said:

Hi @Gerry , I spoke to soon.

Just been notified of one, it was raised just after 1pm today. I guess that it will happen on any Change raised prior to the fix being deployed?

Chris

Hi Chris,

Sorry just seen this, did that reoccur or was it jus that one. If you let us have the request reference we can investigate and see if there is anything we have missed.  We have been monitoring today and there does not appear to be any further issues so as you say that might just have been one logged before the last update

Gerry

Link to comment
Share on other sites

Following on from the issues above I wanted to give you an update on the issue and a breakdown of the root cause, lessons learned and subsequent actions.

The problem experienced today was caused by an internal server change we have made in relation to JSON serialisation.  Specifically, this relates to both Flowcode and BPM, both of which use V8 JavaScript engine under-the-hood for processing. Most of our data structures are represented in XML and in order to be JavaScript compatible we transform XML structures to JSON.  There are some areas of our code that use a “less than optimum” method of transformation and we have been slowly working our way through changing this to a better, more performant and standardised way of exposing data structures to JSON. Unfortunately, in our application development endeavours we have “taken advantage” of the incompatible elements of this implementation so when we fixed the platform serialisation, it caused the Service Manager application to have the issues we experienced today.

To make things worse, while we have comprehensive tests that would have covered this and caught this problem before it left our beta stream, and as a general rule the platform updates always run ahead of application updates. We test all applications against live to ensure compatibility with everything in production. However, recently we had to hold back a couple of platform updates for various reasons, and so today our release streams got out of sync, so much so that our Service Manager release pipeline was running ahead of our platform release pipeline, and the result was our regression tests missed this problem which ultimately made it to live. 

The problem did not impact all of our customers, somewhere around 30% of instances were affected, the problem was invoked under certain conditions when certain optional data was not present, so it was highly dependent on individual customers configurations and business processes. Under normal circumstances we catch these things before they ever get to live, and the odd one that does get away we generally identify very quickly. But the nature of this problem meant we were unable initially to re-create it, it took us the best part of 2 hours to identify and fully understand the issue, and another hour to create and test hotfixes before deploying.

We have obviously learned a number of critical lessons here and are making the following changes to our own processes: -

  • We are introducing parallel automated testing against both “beta” and “live” streams for Service Manager development
  • We are reviewing all Service Manager application code to search out any other potential occurrences of this problem.
  • There are a number of other places in the platform where this legacy JSON serialization is happening, we are going through each area and formulating a plan of action to re-factor these, by eliminating these incompatibilities we will significantly reduce the likelihood of such a problem reoccurring in the future.

We do have a lot of automated testing and in-line with our policy, now we have seen this error we will of course be expanding our automated testing further to cover the scenarios we witnessed today and other potential scenarios we are able to envisage.

We have been able to make really great progress with our platform and applications because of our ability to keep pushing forwards.  It is natural to question the validity of a continuous delivery model in light of such a failure but I also think it is incredibly important that we continue to push forward delivering the features and changes you ask for in order to support your objectives, finding the right balance here is challenging and although todays mishap was painful I want to try to remind everyone here that we have also deployed many hundreds of trouble-free updates, I hope the overall benefits you get will be far outweighed by the odd failure we experience.  

I apologise unreservedly for the disruption caused to your morning, this is not the standard we like to have from for ourselves, or the Monday morning we aspire for, we care deeply about the quality of the service we are able to provide.  I want to thank those of your who were impacted for your feedback and for giving us the space we needed to implement the quickest possible resolution.

Gerry

  • Thanks 2
Link to comment
Share on other sites

15 hours ago, Gerry said:

Hi Chris,

Sorry just seen this, did that reoccur or was it jus that one. If you let us have the request reference we can investigate and see if there is anything we have missed.  We have been monitoring today and there does not appear to be any further issues so as you say that might just have been one logged before the last update

Gerry

Hi @Gerry

Just that one so far. Restasrting the process worked so we should be ok from here on out. Fingers crossed.

Chris

  • Like 1
Link to comment
Share on other sites

On 3/12/2018 at 10:29 AM, Paul Alexander said:

@BenKeeley

Have you restarted the affected tickets (by pressing the 'retry icon next to the HUD)?

I think you need to have the admin role to see this though? I'm sure you have...but it's always worth asking!

image.thumb.png.9af79a1ebbedb4ae5f251a31f0f717e3.png

 

@BenKeeley do you know how to show the retry button next to the HUD? I can't see it on mine.

Thanks

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...