Drop in the success rate of Forge hosted storage API calls

Incident Report for Jira Work Management

Resolved

We identified a problem with the Forge hosted storage API calls, which resulted in a drop in invocation success rates in the developer console. The impact of this incident has been mitigated and our monitoring tools confirm that the success-rate is back to the pre-incident behaviour. It impacted 16 apps according to our logs, where these apps saw a reduced success rate of storage.get API calls, as listed in https://developer.atlassian.com/platform/forge/runtime-reference/storage-api-basic.

As part of Forge's preparation to support Data Residency, Forge hosted storage has been undergoing a platform and data migration for storing app data. As part of this migration we do comparison checks for data consistency between the old and new platform. The previous incident earlier, https://developer.status.atlassian.com/incidents/9q71ytpjhbtl, had put the data on the new platform out of sync and so comparisons of the data from the old and new platform started showing failures and the migration logic retries on failures to test for consistency issues. This retry behaviour increased latency of these requests which led to 16 apps receiving an increased number of 504 timeout errors.

Checking synchronously was identified by the team as a bug and should have been async. Once the root cause was identified we moved our backing platform rollout to a previous stage. The rollout is split into several stages. The issues we were having were on our blocking stage where we make calls to both the old and new platform and wait for both to complete so we can test any performance issues in the new platform before using it as our source of truth. It was in this blocking stage where we had a bug that included waiting on comparisons when it should've been async.

To recover, we reverted back to our shadow mode stage. In this stage, all operations to the new platform are asynchronous, including comparisons that were blocking in the other stage and resulted in timeout issues and 504 errors being sent to apps. This is the state that Forge hosted storage has been in for several months without any problems.

Here is the timeline of the impact:
- On 2024-02-05 at 06:42 PM UTC, impact started with comparisons start happening on out of sync data in blocking mode
- On 2024-02-05 at 08:57 PM UTC, impact was detected to API by our monitoring systems
- On 2024-02-05 at 11:34 PM UTC, rollout to new platform was reverted to known stable state and impact ended

We will release a public incident review, PIR, here in the upcoming weeks for this and the incident that happened earlier, https://developer.status.atlassian.com/incidents/9q71ytpjhbtl. We will detail all that we can about what caused the issue, and what we are doing to prevent it from happening again.

We apologise for any inconveniences this may have caused our customers and the developer community and committed to preventing further issues with our hosted storage capability.

Posted Feb 06, 2024 - 02:40 UTC