157 points by cpldcpu 4 days ago | 20 comments
Someone 3 days ago
It’s not completely unavoidable: don’t use function parameters (globals are your friends on these CPUs). You can’t avoid having a return stack, but you can make as few function calls as possible (ideally zero, but you may have to write functions to fit things into ROM)
> *”To solve this, I flattened the inference code”
I think that’s “make as few function calls as possible”
> and implemented the inner loop in assembly to optimize variable usage.
That _should_ only make a difference for memory usage if your C compiler isn’t perfect (but of course, it never is, certainly on CPUs like this one, which is a poor fit for C)
cpldcpu 3 days ago
Considering that the PMC150 has an accumulator based 8 bit architecture which is almost hostile to C, it is safe to assume that the compiler is not perfect :)
whobre 3 days ago
dpassens 3 days ago
ska 3 days ago
varispeed 3 days ago
Lerc 3 days ago
I wonder if you could do it on the full 28*28 by never holding the full image in memory at once, just as an input stream. say a 1d convolution on each line as it comes in to turn a [1,28] to [3,7] buffer two lines of the [3,7] = 42. Then after there are three results of the third line convolution are produced [3,3]=9, start performing a 2d convolution using the first two lines [2,3,:3] replacing the data at the start (as it has already been processed).
cpldcpu 3 days ago
In this case I am streaing from ROM anyways, so it does not matter if the inputs are read only once or multiple times.
malwrar 4 days ago
I wish tfa would have found some way to measure the PMS150C implementation the headline brags about, but even the PFS154 (2x mem, 3x price) version is super neat! Interesting to see how the net in particular is built at such small scale. I also wish they included numbers about performance like they do in their linked CH32V003 post. I'm wondering how quick these MCUs are compared to each other and e.g. OP's PC, and how hot they get under sustained load.
cpldcpu 3 days ago
But it is easily possible to estimate the execute time:
- mulacc of one weight takes 11 clock cycles.
- There are 1696 weights in the model, each one is only touched once.
- We can assume ~25%-50% overhead for loops and housekeeping (1:4 unrolled)
=> ~23000-28000 clock cycles per inference, which is less than 2ms at 16MHz
Since this is an MLP, the inference time directly scales with the number of weights. (This would be different for a CNN)
As per veryfing on PMC150C - I considered using an LED for valid/nonvalid output. But iterating with OTP devices is quite tedious when you do not have an emulator. Since both devices are code compatible, we can assume that the code works on the smaller devices, though.
wongarsu 3 days ago
Though I believe for most people "roughly 2ms" is good enough
pjmlp 3 days ago
However for going into production with something like this, maybe writing everything in Assembly, and not just some parts, would be much better.
But after a quick search it seems the macro assembler story for RISC-V isn't that great.
kragen 3 days ago