Jump to content

IMPORTANT: Some instances experiencing a service outage/disruption (Resolved)


Victor

Recommended Posts

  • Victor changed the title to IMPORTANT: Some instances experiencing a service outage/disruption (Fix in Progress)
  • Victor changed the title to IMPORTANT: Some instances experiencing a service outage/disruption (Resolved)

All,

Please see below the full post mortem. We have already taken steps to ensure that this issue can not occur again and have re-escalated the issue to developers in the CentOS team.

HTL INCIDENT REPORT 020720180001
At 15:20 on Monday 2nd July 2018 our monitoring detected Slow queries and Virtual cache locks on one of our database servers, the root cause was found to be an XFS cache\write issue which was preventing subsequent writes. All queries were therefore stacked as pending in memory. The resolve was to clear the cache locks which was performed at 15:24 and all was resolved at 15:30.

A similar type of issue has occurred before and unfortunately has not been resolved in a later kernel as anticipated.

Post incident we have now set a scheduled job to clear the cache\locks every 30 days (Given that we have only ever seen this issue 3 times over 2 years on less than 2% of databases servers we expect this frequency to be more than sufficient) and escalated this to the OS developers.  The schedule for clearing will be reviewed every few days up until the 30 days to ensure this is sufficient.

Kind Regards

Keith Stevenson

  • Like 1
Link to comment
Share on other sites

  • Victor unpinned, unfeatured and locked this topic
Guest
This topic is now closed to further replies.
×
×
  • Create New...