Commercial Application Scripts
Incident Report for TrackVia
Postmortem

Summary

On July 2nd at 6:47 AM MT, TrackVia Support reported an increase in support tickets related to application scripts. A quick investigation attributed the issue to a recent software release. Reverting the application change corrected the issue and full functionality was restored by 8:00 AM MT. Full timeline detailed below.

Impact Analysis

A number of accounts encountered failed transactions corresponding error messages during the incident timeline. No data corruption was identified as a result of the incident.

Scope:

Scope Value
Count of Accounts Impacted 84
Count of Failed Transactions 3212
% Total Traffic Impacted 0.04%

Root Cause

A new application script threading model introduced in the 23.43 release was designed to manage app script compilation and execution in a single-threaded manner, per application server. This design was intended to mitigate a long-standing issue. As part of the architectural changes, we also provided a fallback mechanism through an application configuration value that would allow us to utilize the legacy threading logic should any unexpected errors occur.

The ultimate root cause is still under investigation. We believe we are running into a contention scenario related to a singleton value shared across application server instances and threads.

Resolution and Recovery

TrackVia Operations added the config parameter, forcing the threading logic to revert to legacy handling routines, which remained unchanged in the codebase.

New application servers were deployed utilizing the configuration override, restoring full service.

Corrective and preventative measures

  • TrackVia applications will investigate any error anomaly monitor to identify these type of issues
  • The overridden configuration value will active until such time we have fully identified and mitigated the root cause in a future release.
  • QA Testing methodology will be evaluated and adapted to more accurately reproduce contention-at-scale scenarios that were difficult to replicate in our testing environments.

Incident Timeline

  • July 1 2024 1:01PM - Initial application deployment
  • July 1 2024 3:00PM - First error reported to SEIM
  • July 2 2024 6:47AM - Support team escalated customer reported issue to Operations team
  • July 2 2024 7:38AM - Parameter config changed and application deploy begins
  • July 2 2024  8:00AM - Functionality restored
Posted Jul 02, 2024 - 13:48 MDT

Resolved
On July 2nd at 6:47 AM MT, TrackVia Support reported an increase in support tickets related to application scripts. A quick investigation attributed the issue to a recent software release. Reverting the application change corrected the issue and full functionality was restored by 8:00 AM MT.
Posted Jul 02, 2024 - 08:00 MDT