A recent outage at design platform Canva showcased how seemingly minor technical issues can cascade into major service disruptions, as detailed in a comprehensive incident report by CTO Brendan Humphries.
The incident began with a routine deployment of a new editor page version. While the code itself was bug-free, the deployment triggered an unexpected chain of events that would temporarily bring down Canva's services.
The Perfect Storm
Three key factors combined to create the outage:
- A stale traffic routing rule in Cloudflare's CDN was directing Asian traffic through public internet instead of private fiber networks, causing severe latency
- This latency synchronized over 270,000 user requests waiting for new JavaScript files
- A known performance issue in Canva's API gateway reduced its ability to handle traffic effectively
When the JavaScript files finally became available after a 20-minute delay, all waiting clients attempted to access Canva's API simultaneously, generating 1.5 million requests per second - triple the normal peak load.
Cascading Failures
The API gateway, already hampered by performance issues, couldn't handle this massive spike. As individual gateway tasks failed, the load balancer redistributed traffic to remaining healthy nodes, pushing them toward failure as well. While auto-scaling attempted to add capacity, Linux's memory management system was terminating overloaded containers faster than new ones could be provisioned.
The Recovery
Canva's engineering team implemented a two-pronged solution:
- They attempted to increase system capacity manually
- When this proved insufficient, they used Cloudflare's firewall to temporarily block all incoming traffic
This "circuit breaker" approach allowed the system to stabilize. Traffic was then gradually restored, starting with Australian users under strict rate limits, until full service was restored about 35 minutes later.
Lessons Learned
The incident highlighted several important points about modern system resilience:
- Performance issues can be harder to detect than functional bugs
- Automated systems can sometimes amplify problems during incidents
- Human operators play a critical role in adapting system behavior during crises
- Having flexible configuration options is key for incident response
Canva has since developed detailed procedures for managing similar incidents and continues to work on improving system resilience.