Jump to content

[IMPORTANT] Hornbill service impacted [RESOLVED]


Everton1878

Recommended Posts

@Martyn Houghton,

Service details Progressive Capture form is not shown in the Portal (User clicks on a Service > Catalog to raise a new Request...), so the symptom described only exists in the User App.

The issue in Internet Explorer could affect other areas of the product. We're looking to tackle these as an exercise to prevent the same issue happening elsewhere in the product.

We are preparing a new build, so we hope that this will be addressed very soon.

Link to comment
Share on other sites

We've just pushed a new build of Service Manager to the app store, please update. This solves the problem for Progressive Capture mentioned by Ehsan. He also mentioned areas could be affected and we're looking those as a priority.

 

Link to comment
Share on other sites

2 minutes ago, Martyn Houghton said:

We have told all users to refresh their browsers with 'Ctrl F5' to clear their cache of the previous build code.

^ This

Everyone who updated need to have their session refreshed. Log off/log back on and/or CRTL+F5, clear cache. Then try again.

Link to comment
Share on other sites

  • Victor changed the title to [IMPORTANT] Hornbill service impacted. Please Read!
  • Deen unpinned this topic

We have now completed our investigation into the issue that occurred on 23/10/2019. The details of the RCA are below:

First let me express my sincere apologies for the issues caused. We do understand that such issues are causing a great deal of trouble and stress and we are truly sorry for this.

Background

I will begin by giving you a very high overview of Hornbill functionality, how it works from the user browser down to the instance. There are many, many more details of how this works but for the purpose of this investigation it can be simplified like this:

User browser <-> Hornbill Web Servers <-> Hornbill instance

Any action performed by a user in Hornbill translates into a request sent from the browser to our web servers initially who process all requests coming from the browsers. Some of the requests are part of the data layer meaning the request needs to reach the instance. Once the request is performed a response is sent back following the same path to the user browser.

Timeline

On 22/10 around 2 PM our monitoring tools alerted of an unusual activity taking place in one of our customer instances. At some point during progressive capture flow, the user is presented with a selection for a catalog items. The list of catalog items comes following a request from the browser (to display the relevant catalog items) sent to our web servers who then send this request to be processed by the instance. It appeared that the system tried to retrieve these catalog items (which use getVisibleCatalogs API) multiple times in quick succession which was unusual as this should only be performed once. This unusual activity was brought into the attention of our application developers who began an investigation.

Around 5:30 PM our monitoring tools again alerted for the same unusual activity now occurring in another instance as well. At this point we started to notice service performance being impacted in these two instances. Although investigation continued but it was not classed as critical as it was believed to be isolated to two instances. Moreover the issue appeared to alleviate shortly after the second alert mainly the due to less activity as the office hours ended earlier. Investigation focused on a possible issue with the PCF configuration although at this point we were still gathering data for investigation. We understood at that point that getVisibleCatalogs requested data from the instance (via the web server) and the instance responded with the requested data, was for some yet unknown reason stuck in some sort of continuous loop between the web server and the instance.

Between 07:00 and 9:00 PM our monitoring tools alerted of the same issue now occurring in few other instances. Although the service was not yet impacted at this point, the issue became highest priority as it was obvious that it was not a configuration issue but something else that affected several separate instances. The investigation focus now shifted to find a code flaw, a product defect. Our infrastructure teams and platform development team were also notified of this as started to participate in the investigation. We began analysing the recent code and attempt to replicate the issue in house to have a better understanding how it came to be and what would be the fix for this. During this time our infrastructure team applied short term interim fixes by terminating the loops when they appeared. Investigation was carried out until shortly after midnight but we were not able to find any flaw at that point.

On 23/10 between 09:00 -10:00 AM our monitoring tools started to send out an increasing number of alerts for the same issue mostly due to an increase usage of the service as users began their business hours. The issue that occurred on previous day at the end of business hour now started to severely impact the service due to more and more users using the service. The service was impacted because the web servers and the instance became overwhelmed with the number of requests received which lead to a steady decrease in processing and response times ultimately leading to a slow performance and on occasion on disruption of the service. Most of these requests between the web server and the instance were caused by the issue noticed on previous evening. Although all our teams fully dedicated to finding what caused this and come up with a solution the cause for this still eluded us. We were still unable to replicate the issue in house nor we were able to find flaws conducting code reviews.

Around 11:00 AM our infrastructure team found an important piece of information. It appeared that the issue was initiated when using a certain browser, Internet Explorer. Also, one of our customers shortly confirmed afterwards that the service slow performance is more obvious when using Internet Explorer than when using any other browser. As such our development investigation now tried to understand if the browser engine is causing the issue rather than Hornbill code itself.

Around 01:00 PM it became clear that the issue was triggered when using Internet Explorer and developers found how it happens. As we now understood how it happens, our Service Manager team started to work on a fix and deployed a patch update containing a set of changes to cater for the issue with catalog items (https://community.hornbill.com/topic/16969-hornbill-service-manager-1733/). We then considered the issue resolved which was also confirmed by customers who applied the patch.

Despite having a solution for the issue there were some aspects following our investigation that raised the question if the actual flaw was found and if the investigation fully revealed what the issue is and how it came to be. It was clearly established the issue as follows: to display the catalog items, Hornbill will retrieve them from the instance database, store them into an array, like: "1", "2", "3". The browser processes the array and displays the catalog items. Further investigation into this revealed that on particular IE browser versions (more on that below) enumerating the contents of an array looks like this"1", "2", "3", "find". The affected browser versions are treating the find() method on the array object as an indexed array item; this is incorrect behavior and would appear to be specific to a limited number of version(s) of IE 11. This incorrect additional item was the culprit that caused the loop in the first place: because the array was incorrect, the list of catalog items was requested again but the resulted array was incorrect again which now became the loop. We have been able to re-create this problem in IE 11.356.18362.0 Update Version 11.0.145, which was deployed by Microsoft on 10/09. However, we have found that in IE Update Version 11.0.155 deployed by Microsoft on 08/10 the problem with the incorrect additional array object was fixed. As the browser issue was fixed on 08/10 and the issue in Hornbill occurring on 23/10, it became apparent there is also something else that can cause this issue.

As the investigation continued, around 04:00 PM our Core Collaboration team found an issue with a 3rd party library used by Hornbill. This particular library is used to provide functionality within the activitystreams (timelines, workpsaces). Development team moved this file along with a number of others into Hornbill initial page load rather than loading just before the activitystream is displayed (as we have been doing previously). Because the activitystream is used in so many places it made sense to load it this way and since we weren't changing the file itself it was not believed it would cause any issues. Unfortunately unbeknown to us, this library was adding a function to the javascript language if that function did not already exist. This is sometimes necessary as not all browsers support the same functionality and it allows features to be added to those browsers if they are missing. We already had code to add this functionality in, but due to the change of order of how we are loading the files the feature was added by the 3rd party library instead of by our code. The way the library was adding this function was incorrect and caused an issue to occur in IE11 where this function would appear in a way that code in certain places did not expect, which lead to that code behaving in an unexpected ways and in-particular causing multiple repeated calls to the server in rapid succession, the issue we have been presented above. As a result, our Core Collaboration development team also deployed a patch update to address this (https://community.hornbill.com/topic/16974-new-update-hornbill-collaboration-1110/. We consider the issue being fully resolved at this point.

Conclusion

While the issue was ongoing we noticed and had reports of instances being affected by service degradation while not being involved in the issue. There were a number of instances with users not using Internet Explorer thus not being able to trigger the issue. However, as it was presented in the beginning, all requests from any user, any browser, in any instance is processed by a web server and when it comes to data, by the database service as well. Because the web server and database services were simply overwhelmed processing an unusual high amount of requests (most created by the looping issue) it mean that any user would eventually experience delays in processing requests and delays in receiving responses even if they did not participate in creating the issue in the first place. It was, simply put a DDoS type of scenario.

To address the underlying issue caused by the 3rd party library, our development teams are now reviewing the process in which such libraries are implemented, we have now put additional checks in place to prevent similar scenarios occurring in the future.

We have a robust infrastructure system in place and it's designed to be able to cope with the number of requests we usually see in any customer instances. However, in the unlikely scenario where the infrastructure receives an unusual number of requests, we are (and were) looking to implement a more robust system based on micro services to minimise and hopefully eliminate any indirect impact such as the one experienced in instances that were not directly causing the issue but were affected by it.

 

  • Thanks 1
Link to comment
Share on other sites

  • Victor changed the title to [IMPORTANT] Hornbill service impacted [RESOLVED]

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...