171 points by mooreds 6 days ago | 48 comments
shaggie76 6 days ago
One CDN vendor didn't even offer a tiered distribution system so every edge called home at the same time, another vendor did have a tiered distribution system designed to avoid this problem but it was overwhelmed by the absurd number of files we'd serve multiplied by the large user count and so we'd still end up with too much traffic hitting the origin.
The interesting thing was that no vendor we evaluated offered a robust preheating solution if they offered one at all. One vendor even went so far as to say that they wouldn't allow it because it would let customers unfairly dominate the shared storage cache at the edge (which sort of felt like airlines overbooking seats on a flight to me).
These days we run an army of VMs that fetch all assets from every point of presence we can cover right before launching an update.
Another thing we've had to deal with mentioned in the article is overloading back-end nodes; our solution is somewhat ham-fisted but works quite well for us: we cap the connection counts to the back end and return 503s when we saturate. The trick, however, is getting your load-balancer to leave the client connection open when this happens -- by default multiple LBs we've used would slam the connection closed so that when you're serving up 50K 503s a second the firewall would buckle under the runaway connection pool lingering in TIME_WAIT. Good times.
snackbroken 6 days ago
Edit: Silly me for posting while sleep deprived. It's not the update itself that you're saying is causing thundering herd issues, but the log-ins all being synced up afterwards much like in TFA, duh. My curiosity wrt the apparent lack of P2P game updaters still stands though.
donavanm 6 days ago
The primary business problem is one of visibility and control. The customer UX would be entirely out if your control, and exceedingly variable, based on factors you (the provider) cant even see. At the same time CDNs were pushing down to cents per GB delivered by 2010, and ~1¢/GB by 2015. At a penny per GB distribution for higher throughout, better visibility, and control CDN distribution costs started to not matter compared to other costs and priorities.
Oh! Porn delivery companies, theyre an interesting content distribution case. AFAIK commercial CDNs are still way too expensive to meet their business model needs. My recollection is that they all built their own in house CDNs, like GPs “run a bunch of VMs” approach, or used a peers. This was accelerated as all of those companies consolidated ala MindGeek in the 2010s.
dikei 6 days ago
pl4nty 5 days ago
Steam recently introduced LAN-based P2P to complement their significant appliance/CDN infrastructure, but idk if anyone has pulled it apart yet. and I don't think it does tunnelling like the msft network
masklinn 6 days ago
AndrewDavis 6 days ago
UltraSane 6 days ago
donavanm 6 days ago
> thousands of cold edges would call home simultaneously when all players players relogged at the same time.
Our more mature customers (esp console gaming) would enable early background downloads, spaced out over a few hours, the day/hours before 'launch'. Otherwise adhoc/jit is definitely best effort, though we did a few things to help:
Conceptually each CDN POP is ~3 logical layers 1) a client-request-terminating 'hot' cache distributed across all nodes in the POP 2) a shared POP cache segmented by content/resource ID 3) a shared origin-request-facing egress layer. Every layer would attempt to perform request coalescing, with 90% efficacy or more. eg, 10 client requests to the same layer 1 node _should_ only generate a single request to the segmented layer 2 cache. The same layer 2 node would we serving multiple requests to the layer 1 nodes, while making a single request back towards the origin.
Some exceptional behavior occurred, or was driven by, 'load' and trying to account for 1) head of line blocking 2) tail latencies etc from inequal load distribution. Based on load for an object, or a nodes current total load, we used forward signaling to distribute requests to peers. That is a 'busy' layer 2 node would signal to the layer 1 nodes to use additional/alternate peers. This increased the number of copies of a popular object in the segmented cache, increasing the total throughput available to populate the 'hot' L1 cache nodes _or_ to serve objects that were not consistently popular enough to stay in that distributed L1 cache. And relevant to your example we had similar problems when going back to the origin; In the first case we want to minimize the number of new TCP/TLS connections, which have a large RTT setup penalty, by reusing active & idle 'layer 3' connections to the origin. This, however, introduces hotspots and head of line blocking for those active origin connections. Which, again, based on 'load' would be forward signaled so that additional layer 3 nodes/processes would be used to fetch _additional_ origin content.
Normally this all means 1 origin request can serve a few orders of magnitude more concurrent client requests. For very large content, or exceedingly large client numbers, you'd see the CDN 'scale out' on concurrency in an effort to minimize blocking and maximize throughput in the system.
> One CDN vendor didn't even offer a tiered distribution system so every edge called home at the same time, another vendor did have a tiered distribution system designed to avoid this problem
See above on request coalescing. In the vast vast majority of cases it was effective in reducing the problem by a few orders of magnitude; AFAIK every CDN does/did that. _In addition_ we did have an distributed hierarchal system for caching between edge POPs and origins _but_ it was non-public/invite/managed by us for a long time. The reason being that the _vast_ majority of customers incurred additional latency (& cost to us) without meaningful benefit from this intermediate cache layer.
> The interesting thing was that no vendor we evaluated offered a robust preheating solution if they offered one at all.
This is interesting to me. AFAIK Akamai Netstorage was sold to solve the origin distribution angle, _and_ drove something like 50% of the revenue from large object distribution customers. For us the customer use case of 'prefetch' was perennial 'top 5' but never one that would drive revenue, and conflicted with other system tenets.
> One vendor even went so far as to say that they wouldn't allow it because it would let customers unfairly dominate the shared storage cache at the edge
That could have been us. And yes a huge problem is that you're fundamentally asking for control over a shared resource so that you can bias performance to _your content_ at the expense of _all other customers_. Even without intentional 'prefetch' control we had still had some customers with pseudo-degenerate access patterns that might consume 25-50% of the shared cache space in a POP. We did build shared quotas and such but (when I was there) we couldn't see a way to align the pricing & incentives to confidently expose that to customers. It also felt very very bad to tell a customer 'pay us $$$ to care about your bits' when that was our entire job, and what we were doing to the best extent possible already.
> we cap the connection counts to the back end and return 503s when we saturate.
Depending on the CDN you may be able to use `max-age` or `s-maxage` to implement psuedo backoff from the CDN. For us at least those 'negative hits' would be cached with a short (seconds by default) TTL to act a dampener in failure scenarios. Ensure that your client can handle/recover from the 503 as well, I'd expect the CDN to return those all the way through in the response.
donavanm 6 days ago
I should also give a sense of scale here. Hundreds of GB/s to multi TB/s of throughput for a single customer was pretty normal a decade ago. CDNs, classically, are also biased towards latency & throughput. Once you have millions of client requests per second and pushing that kind of volume its kind of expected/implied that the origin is capable of meeting the demand necessary to maximize that throughput.
While cost efficiency maximizing CDNs _were_ a thing they kind of died out with Red Swoosh AFAIK. We repeatedly investigated 'follow the moon' use cases to maximize the diurnal cycle. Outside of a handful of game companies there wasnt any real interest, and the price/revenue wasnt worth investing compared to other priorities. The market wanted better performance, not minimal costs, in the 2000-10s.
gsck 6 days ago
But you close Warframe after the red text and the game updates pretty fast, even if its a massive update like 1999 was, and then you are back in the game (Unless you say yes to Optimising download cache, that takes an absolute age for some reason plsfix), definitely a pretty amazing engineering achievement.
robertlagrant 6 days ago
Animats 6 days ago
The shortest term effects are power supplies recharging their capacitors and incandescent bulbs warming up. That's over within a second.
Then it's the motors, which have 2x-3x their running load when starting as they bring their rotating mass up to speed. That extra load lasts for tens of seconds.
If power has been off for more than a few minutes, everything in heating and cooling which normally cycles on and off will want to start. That high load lasts for minutes.
Bringing up a power grid is thus done by sections, not all at once.
EvanAnderson 6 days ago
ElusiveA 6 days ago
Other electrical devices such as transformers and long overhead power lines also exhibit inrush when they are energised.
_heimdall 6 days ago
Our road used to have a handful of houses on it but now has around 85 (a mix of smaller lots around an acre and larger farming parcels). Power infrastructure to our street hasn't been updated recently and it just barely keeps up.
We had a few days that didn't get above freezing (very unusual here). Power was out for about 6 hours after a limb fell on a line. The power company was actually pretty quick to fix it, but the power went out 3 more times in pretty quick succession.
Apparently a breaker kept blowing as every house regained power and all the various compressors surged on. The solution at the time was for them to jam in a larger breaker. I hope they came back pretty quickly to undo that "fix" but we still haven't had any infrastructure updates to increase capacity.
alvah 6 days ago
I've seen some cowboy sh!t in my time but jeez, that's rough.
cr125rider 6 days ago
cudgy 6 days ago
corint 5 days ago
emmanueloga_ 6 days ago
* "We're adding timeouts to prevent user requests from waiting excessively long to retrieve assets."
When you get to the size of Canva, you can't forget your AbortController and exponential backoff on your Fetch API calls.
--
0: https://www.canva.dev/blog/engineering/canva-incident-report...
benatkin 6 days ago