MichelleReaney Posted December 5, 2022 Share Posted December 5, 2022 Is anyone having similar issues this morning? Link to comment Share on other sites More sharing options...
Alistair Young Posted December 5, 2022 Share Posted December 5, 2022 Just now, MichelleReaney said: Is anyone having similar issues this morning? So much 1 Link to comment Share on other sites More sharing options...
Alistair Young Posted December 5, 2022 Share Posted December 5, 2022 Hi all, my Service Manager instance is throwing error message: 'Could not connect to instance'. 1 Link to comment Share on other sites More sharing options...
Alistair Young Posted December 5, 2022 Share Posted December 5, 2022 3 minutes ago, Alistair Young said: Hi all, my Service Manager instance is throwing error message: 'Could not connect to instance'. 1 Link to comment Share on other sites More sharing options...
Victor Posted December 5, 2022 Share Posted December 5, 2022 @Alistair Young one of our infrastructure nodes if having some issues (that are being addressed), if I am not mistaken your instance is affected, I'm afraid... we're working on it. Link to comment Share on other sites More sharing options...
Alistair Young Posted December 5, 2022 Share Posted December 5, 2022 1 minute ago, Victor said: @Alistair Young one of our infrastructure nodes if having some issues (that are being addressed), if I am not mistaken your instance is affected, I'm afraid... we're working on it. @Victor thank you Sir! 1 Link to comment Share on other sites More sharing options...
Steve Giller Posted December 5, 2022 Share Posted December 5, 2022 @MichelleReaney @Alistair Young You have both been linked to our internal Request as Impacted, we have begun to implement the fix and hope to have an update for you shortly. Link to comment Share on other sites More sharing options...
Alistair Young Posted December 5, 2022 Share Posted December 5, 2022 Hi all, we're about 90mins in and a lot of sad-face being directed at me ... Is there a handle on how long we're staring down to recover our instance? 1 Link to comment Share on other sites More sharing options...
PeterL Posted December 5, 2022 Share Posted December 5, 2022 22 minutes ago, Alistair Young said: Hi all, we're about 90mins in and a lot of sad-face being directed at me ... Is there a handle on how long we're staring down to recover our instance? 1 Link to comment Share on other sites More sharing options...
Steve Giller Posted December 5, 2022 Share Posted December 5, 2022 Instances should now be restored and available. The recovery process is on the final step which involves migrating physical files, so access to attachments/linked files (within emails, Requests, Timelines etc.) will become available of the next few hours. Access to the text content of requests, emails etc. should be back to normal. We will have a more detailed update for you later, and if you have any specific problems please let us know. We apologise for any inconvenience this has caused. 1 Link to comment Share on other sites More sharing options...
PeterL Posted December 5, 2022 Share Posted December 5, 2022 1 Link to comment Share on other sites More sharing options...
Deen Posted December 5, 2022 Share Posted December 5, 2022 For any instances still affected by missing files, the recovery process is still ongoing and we expect this to be complete in the next 30 minutes or so. I will confirm here when this has been completed. Link to comment Share on other sites More sharing options...
Deen Posted December 5, 2022 Share Posted December 5, 2022 The recovery process is now complete and all instances have been fully restored. We will be in touch shortly with an RCA and apologise for the disruption. Link to comment Share on other sites More sharing options...
Deen Posted December 7, 2022 Share Posted December 7, 2022 For anyone affected, below is the root cause analysis for this disruption: At 11:18 our monitoring detected strange disk activity and high RAM usage on 1 of our NODES (mdh-p01-node16). This NODE provides core services and file attachments (Not Database records) to 5 of our customers. Investigations showed that the high RAM was a direct result of Slow disk access and all actions were therefore being effected. No obvious cause could be found, all underlying hardware was checked and no errors were detected. At 11:39 a controlled shutdown of processes was started to try and identify which process was causing the unexpected disk usage. It became clear that even with nearly all services stopped (except windows core functionality) the issue persisted. All actions taken to recover this was futile. Recovery was attempted for 30 minutes and then DR plan was initiated to restore the effected customer instances to other NODES. As the time line shows. All of these targets were achieved in less than half the RTO time. (DR Plan invoked after 30 mins, Emergency level of service within 2 hours and Restoration of Key services within 5 hours) TimeLine 11:18 Alerted by Monitoring of HIGH RAM - Investigation Began 11:19 Notified HSML of High RAM Instance and Customers still available 11:20 Investigation of Root Cause 11:30 - Xen Toolstack Restart 11:39 - Controlled Shutdown of Non Critical Processes within VM 11:45 - Controlled shutdown of ESP Services 11:50 - Instances and Customers start to become Unavailable 11:55 - All Instances on NODE unavailable. 12:01 Attempted to Restart and Correct Windows 12:02 Informed HSML of Restart of Windows 12:05 Windows Unavailable. Failed to Restart 12:15 Continued Windows Recovery - Unable to progress 12:24 Started DR Planning 12:25 Decision on New Node16 and Migrate Existing Instances to NODE17\18 14:02 All affected instances back up and running 16:52 File restore 100% DONE Total NON ER Downtime (Instance Unavailable) - From 1h7 Mins to 2H Total Recovery - From 2H to 4:52 Root Cause Due to the nature of the issue and loss of diagnostic logs that may exist in the VM we can no longer access we are unsure of the root cause. However the pattern of issues suggests some problem or corruption with the Virtual Disk containing the Windows System. Given the encrypted drives any small corruption would cause large problems. Further Planned\Required Action Rebuild NODE16 and reBalance clusters - New Node exists. Rebalance will be performed over next few weeks Investigate original failure - This will continue. We will not only be investigating the root cause (although as above with logs inside the VM being the most helpful we don't expect a hard answer), we will also be attempting to recover the NODE and its data in the hope of finding a way that should similar issue occur in future we will be able to recover quicker. Storage Servers - Already planned (Hardware already in Place and initial code changes made) was a change from having Storage (for File Attachments etc) local to the NODE. These would act in replicated pairs. Having these would mean that should we lose a NODE we don't need to restore the data from backup servers. (This would have meant once instance was created ALL was available immediately not after 2-4hours) MicroServices - Already Planned and code changes have been made over the last 2 years to support this. Along with the Storage services. The use of micro services removes the need for a home NODE box and all NODE boxes can service any instance. With this in place (And the Storage Servers) a loss of NODE is no concern (And actually part of the normal routine) We have previously never had a corruption like this and believe the chances of another occurrence are low, we also apologise for the disruption caused by the failure. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now