Inside Canva's Global Service Outage: A Chain Reaction of Technical Challenges

article picture

A recent outage at design platform Canva showcased how seemingly minor technical issues can cascade into major service disruptions, as detailed in a comprehensive incident report by CTO Brendan Humphries.

The incident began with a routine deployment of a new editor page version. While the code itself was bug-free, the deployment triggered an unexpected chain of events that would temporarily bring down Canva's services.

The Perfect Storm

Three key factors combined to create the outage:

A stale traffic routing rule in Cloudflare's CDN was directing Asian traffic through public internet instead of private fiber networks, causing severe latency
This latency synchronized over 270,000 user requests waiting for new JavaScript files
A known performance issue in Canva's API gateway reduced its ability to handle traffic effectively

When the JavaScript files finally became available after a 20-minute delay, all waiting clients attempted to access Canva's API simultaneously, generating 1.5 million requests per second - triple the normal peak load.

Cascading Failures

The API gateway, already hampered by performance issues, couldn't handle this massive spike. As individual gateway tasks failed, the load balancer redistributed traffic to remaining healthy nodes, pushing them toward failure as well. While auto-scaling attempted to add capacity, Linux's memory management system was terminating overloaded containers faster than new ones could be provisioned.

The Recovery

Canva's engineering team implemented a two-pronged solution:

They attempted to increase system capacity manually
When this proved insufficient, they used Cloudflare's firewall to temporarily block all incoming traffic

This "circuit breaker" approach allowed the system to stabilize. Traffic was then gradually restored, starting with Australian users under strict rate limits, until full service was restored about 35 minutes later.

Lessons Learned

The incident highlighted several important points about modern system resilience:

Performance issues can be harder to detect than functional bugs
Automated systems can sometimes amplify problems during incidents
Human operators play a critical role in adapting system behavior during crises
Having flexible configuration options is key for incident response

Canva has since developed detailed procedures for managing similar incidents and continues to work on improving system resilience.

Inside Canva's Global Service Outage: A Chain Reaction of Technical Challenges

The Perfect Storm

Cascading Failures

The Recovery

Lessons Learned

AerynOS: Revolutionizing Linux with Infrastructure-First Design

Linus Torvalds Slams 'Disgusting' DRM Testing Code in Latest Linux Kernel

AI Web Crawlers Force Website Operators to Take Extreme Defensive Measures

Critical Security Flaw in Fedora's Pagure Could Have Compromised Linux Package Distribution

Amsterdam's Linux Repair Cafes Give Second Life to Old Laptops as Windows 10 End Looms