104 points by sestep 4 days ago | 58 comments
andersa 19 hours ago
A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
wtallis 16 hours ago
However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.
less_less 10 hours ago
At this point I'm kinda expecting CPU vendors to stop putting as many Spectre mitigations in the main core, and just have a small crypto core with full-fat arithmetic, less hardware for memory access, less speculation, and careful side-channel hardening. You still have to block Meltdown and other large vulnerabilities on the main cores, but if someone wants to protect elliptic curves from weird attacks? Try to set the DIT bit, trap into the OS, and get sent to the hardened core.
gpderetta 9 hours ago
deepsun 11 hours ago
wtallis 11 hours ago
hansvm 15 hours ago
elcritch 5 hours ago
Well in a language that allows you to generate compile time specialized serde code like Nim or Zig. Maybe C++ has enough compile time reflection to do it now as well?
I don't know enough about Rust's serde, but it seems like there'd be a lot of performance overhead with it's design and the limits of Rust's macro and compiler system.
Retr0id 5 hours ago
*maddeningly, with mutually incompatible sorting rules.
sestep 3 hours ago
jltsiren 12 hours ago
You can avoid this with two arrays. One contains random query positions, and the target array is also filled with random values. The next index is then a function of the next query position and the previous value read from the target array.
eru 10 hours ago
Eg start with every element referencing the next element (i.e. i+1 with wrap-around), and then use a random shuffle. That way, you preserve the full cycle.
jltsiren 10 hours ago
eru 7 hours ago
Basically, pick a large prime number p as the size of your array and a number 0 < x < p. Then visit your array in the order of (i*x) modulo p.
You can also do something with (x^i) modulo p, if your processor is smart enough to figure out your additive pattern.
Basically, the idea is to look into the same theory they use to produce PRNG with long cycles.
jiggawatts 18 hours ago
Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.
Without this overhead, small array random access should have a lot better per-element cost.
sestep 2 hours ago
delusional 13 hours ago
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
alain94040 2 hours ago
CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.
phi-go 12 hours ago
shakna 11 hours ago
delusional 11 hours ago
wtallis 11 hours ago
kortilla 11 hours ago
cb321 10 hours ago
This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).
EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.
gpderetta 9 hours ago
Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).
At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.
Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.
[1] Even many in-order CPUs have some run-ahead capabilities for memory.
cb321 9 hours ago
I still suspect @kortilla is one of today's lucky 10,000 (https://xkcd.com/1053/) or just read/replied too quickly. :-)
There is a lot written that indicates that the complexity of modern CPUs is ill-disseminated. But there is also wonderful stuff like https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... { To add a couple links to underwrite my reply in full agreeance. :-) }
porcoda 18 hours ago
jandrewrogers 16 hours ago
porcoda 15 hours ago
jandrewrogers 13 hours ago
There were proofs of concept by 2010 that the latency-hiding mechanics could be implemented on CPUs in software, which while not as efficient had the advantage of cost and performance, which was a death knell for the MTA. A few attempts to revive that style of architecture have come and gone. It is very difficult to compete with the economics of mass-scale commodity silicon.
I hold out hope that a modern barrel processor will become available at some point but I’m not sanguine about it.
JonChesterfield 13 hours ago
One hint in the same article that random access is not cheap, in contrast with the conclusion, was noticing that the shuffle was unacceptably slow on large data sets.
Still, good to see peformance measurements, especially where the curves look roughly like you'd hope them to.
sestep 2 hours ago
archi42 8 hours ago
This means: The system likely uses 3x 8 GB modules. As a result, one channel has two modules with 16 GB total, while the other channel has only a single 8 GB module.
Not sure how big this impact is with the given memory access patterns and assuming [mostly] exclusive single-threaded access. It's just something I noted, and could be a source of unexpected artifacts.
sestep 4 hours ago
CodesInChaos 5 hours ago
Animats 13 hours ago
tiluha 3 hours ago
beng-nl 2 hours ago
And of course the headline of getting a cpu you can’t fully use.
9 hours ago
Adhyyan1252 20 hours ago
Nevermark 14 hours ago
o11c 13 hours ago
sestep 2 hours ago
anonymousDan 9 hours ago
sestep 2 hours ago
19 hours ago
b0a04gl 12 hours ago
FpUser 16 hours ago
Andys 16 hours ago
Presumably if you'd split the elements into 16 shares (one for each CPU), summed with 16 threads, and then summed the lot at the end, then random would be faster than sorted?
bee_rider 13 hours ago
Although, it looks like that chip has a 1MB L2 cache for each core. If these are 4 Bytes ints, then I guess they won’t all fit in one core’s L2, but maybe they can all start out in their respective cores’ L2 if it is parallelized (well, depends on how you set it up).
Maybe it will be closer. Contiguous should still win.
Surac 12 hours ago
forrestthewoods 19 hours ago
https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...
I’m not sure I agree with the data presentation format. “time per element” doesn’t seem like the right metric.
sestep 2 hours ago
The reason I chose "time per element" then follows from that different goal, because I was comparing across vastly different array sizes, so no other metric I could think of would have really worked for the charts I was drawing.
klank 17 hours ago
Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.
If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.
However, having said all that, I'd love to hear what your reservations are using it as a metric.
forrestthewoods 13 hours ago
Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...
I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.
Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!
The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.
On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).
There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.
I dunno. Those are all just surface level reactions.
sestep 2 hours ago
Agreed that the odd spikes don't matter, that's why I didn't bother discussing them; I was more interested in the data after the array got large enough that random access was actually slower. It looked like all those weird spikes were for arrays small enough to fit in cache anyways.
I agree that it could have been helpful if I'd drawn lines at L1/L2/L3/RAM limits, but I didn't do that because I don't think it's entirely clear where those lines should have been drawn. Specifically because there are two arrays. Should the line show just where the floating-point array is small enough to fit in cache, or where both arrays together are?
Not sure I quite follow what you're saying about mmap on Linux vs Mac; only one of the three sets of experiments used mmap, and the third was explicitly to try to tease out that effect. Especially for the first experiment, I agree that there should be no difference for arrays small enough to fit in RAM, since the whole file gets read into memory first.
alain94040 17 hours ago
> Random access from the cache is remarkably quick. It's comparable to sequential RAM performance
That's actually expected once you think about it, it's a natural consequence of prefetching.
forrestthewoods 11 hours ago
Lots of things are expected when you deeply understand a complex system and think about it. But, like, not everyone knows the system that deeply nor have they thought about it!
delusional 13 hours ago
petermcneeley 15 hours ago
17 hours ago