Higher than expected API error rates
Incident Report for AgileMD
Postmortem

New iOS version released; API environment not prepared for the ensuing request volume

At approximately 11:00am, Apple transitioned the next release of the AgileMD app to "ready for sale" which made the app available to all AgileMD members running iOS.

By 11:30am, members who use an iOS device began to update their AgileMD app to version 2.3. The 2.3 release includes, among various smaller bug fixes, a relatively large refactor of the AgileMD sync engine. Internally, we refer to the current engine as Sync3. (As an aside, we will provide some performance data for Sync3 in an upcoming blog post).

The client upgrade from Sync2 to Sync3 requires a data format change. As a result, the new app must redownload all subscribed content after the upgrade is complete.

At the time members began to upgrade, our AWS environment was running our standard compliment of servers. However, because the AgileMD community has grown, the number of members upgrading immediately as well as the number of files per member has increased significantly (between 1 and 2 orders of magnitude).

Our infrastructure is cache-heavy and we triage most content into various tiers with the lowest levels and fallback strategies resolving to our API servers directly. At the time the app was released, our production caching layer was not primed to deliver content and instead began to lazy-load content from the API.

Client retry logic compounded request volume

By 12noon, the caching layer was overwhelmed with processing tasks and was responding with 503s to between 20 and 50% of requests. This, in turn, was pushing request volume back to the API servers.

The retry logic in iOS 2.3 is such that if a sync request fails, it is automatically rescheduled. Additionally, if the app is closed or restarted in some way, the sync engine will abandon our primary caching system and being to request individual files from the API. Version 2.3 does not restrict retry attempts so file requests are retried ad infinitum at 2-3 second intervals.

These two forces (high cache-miss rates and high client request speed) quickly magnified observed API requests far beyond normal volume. The API servers began throwing 503s to as many as 20% of requests. And thus a vicious circle of cache-miss, request, request-fail, retry began with request volume peaking above 50k requests per minute versus a normal rate of ~1k per minute.

Our temporary solution was to increase API server capacity (by spinning up 50% more instances than we normally run) which would provided (1) our caching layers more hardware to use to complete their tasks and (2) a more stable environment for clients to request files individually from the API.

By 12:45 the extra server capacity had improved our baseline response rate and API error rate to ordinary operating levels. We will continue to monitor performance as members upgrade their apps.

Going forward

In short, we experienced the equivalent of a self-inflicted DDoS attack—among, perhaps, the most humbling of systematic failures.

In response, our team has made a several changes:

  1. Previously, our latent server capacity was more than enough to compensate for widespread parallel resync that occurs on upgrade. However, We have grown quickly in the last 6 months and our deployment protocol has remained the same. Starting today, we have updated our release checklist to include hardware scaling to accommodate expected initial load.

  2. We will increase the frequency at which our cache priming system checks for pending updates. This should mitigate scenarios in which the caching layer must prime itself from scratch during the same time that clients are making frequent requests.

  3. Version 2.4 of the iOS app will ship with more sophisticated retry logic (using a combination of time checks and failed request checks).

Posted over 4 years ago. Mar 25, 2014 - 14:03 PDT

Resolved
This incident has been resolved.
Posted over 4 years ago. Mar 25, 2014 - 14:00 PDT
Monitoring
We have increased server capacity by 50%; API error rates are decreasing rapidly and sync latency is returning to normal. We will continue to monitor performance throughout the day.
Posted over 4 years ago. Mar 25, 2014 - 12:45 PDT
Identified
Our infrastructure is reporting far higher than normal latency across all reader applications. Most of the traffic appears to be coming from members upgrading to the latest version of the iOS app. We're adding more hardware to compensate.
Posted over 4 years ago. Mar 25, 2014 - 12:28 PDT