42 points by Fake4d 1 week ago | 61 comments
venning 4 days ago
So `n % 2 == 1` should probably [1] be replaced with `n % 2 != 0`.
While this may be obvious with experience, if the code says `n % 2 == 0`, then a future developer who is trying to reverse the operation for some reason must know that they need to change the equality operator not the right operand. Whereas, with `n % 1 == 0`, they can change either safely and get the same result.
This feels problematic because the business logic that necessitated the change may be "do this when odd" and it may feel incorrect to implement "don't do this when even".
I really disfavor writing code that could be easily misinterpreted and modified in future by less-experienced developers; or maybe just someone (me) who's tired or rushing. For that reason, and the performance one, I try to stick to the bitwise operator.
[1] Of course, if for some reason you wanted to test for only positive odd numbers, you could use `n % 2 == 1`, but please write a comment noting that you're being clever.
userbinator 4 days ago
That's their problem. Otherwise you're just contributing to the decline.
notfish 4 days ago
Edit: apparently JS, java, and C all do this. That’s horrifying
jagged-chisel 4 days ago
seritools 3 days ago
https://stackoverflow.com/questions/13683563/whats-the-diffe...
michael1999 4 days ago
arcastroe 1 day ago
Think of a clock which is a ring of size 12. In a clock, going backwards 15 hours (-15) is the same as going backwards 3 hours (-3) which is the same as going forwards 9 hours.
-15 = -3 = 9 modulo 12
affinepplan 4 days ago
neuroelectron 4 days ago
tux3 1 week ago
For instance if you're making a loop to count the bits that are set in a number, the compiler can recognize the entire loop and turn it into a single popcnt instruction (e.g. https://lemire.me/blog/2016/05/23/the-surprising-cleverness-... )
eichin 5 days ago
cperciva 4 days ago
To elaborate a bit on the specialness of popcount: It is a generally accepted belief in the computer architecture community that several systems included a popcount iNstruction Solely due to A request from a single "very good customer".
LegionMammal978 4 days ago
dietr1ch 5 days ago
Look at this --beauty-- eww, thing, should compilers really spend time trying to figure out how to optimise insane code?
def is_even(n):
return str(n)[len(str(n))-1] in [str(2*n) for n in range(5)]
ajross 5 days ago
refulgentis 5 days ago
I could see that as a novel feedback mechanism for software engineers.
As it stands, I'm glad they design optimizations abstractly, even if that means code I don't like gets the benefits
dietr1ch 5 days ago
refulgentis 4 days ago
tl;dr: there are general optimizations for "this function in a for loop is a constant expression, we dont need to call it 500 times"
or
"this obscure combination of asm instructions is optimal on pentium iii 350 mhz dual core"
not "we need to turn this unholy CS101 student spaghetti code where they do a 500 branch-if into a for loop"
comment over here is attempting to communicate that as well https://news.ycombinator.com/item?id=42705758
I've never, ever, heard the idea that compilers are burdened by the workload of maintaining thousands of type-specific optimizations for hilariously bad code, until today. I've been here since 2009, so it is puzzling to me to see it referred to off hand, in a "this is water" manner https://en.wikipedia.org/wiki/This_Is_Water
dietr1ch 4 days ago
I've heard tons of people complain about slow compilers, so even if compiler devs find it easy architect their compilers to do multiple kinds of optimisations there's a cost to it that devs running the compilers pay.
Also, if you think about it, optimising code has to follow diminishing returns, so at some point we are putting too much CPU time into little to no gains, and it's also possible to get slower code with more optimisations if they interact poorly, or at least not better code even if spending more CPU time. This is why there's -O3 in gcc and it's not the default, there's a cost to it that's likely not worth paying.
refulgentis 4 days ago
A slow compiler does not imply the compiler is slow because there's thousands of bespoke optimizations for nonsense code being ran
> Also, if you think about it, optimising code has to follow diminishing returns,
Nope, trivially. Though, I'm always eager for a Fermat-style marvelous proof that may have been too big for the initial margin you had. :)
Take a classic case of a buggy compiler generating O(n²) temporary copies due to missed alias analysis. One optimization pass to fix that analysis transforms it to O(n).
> at some point we are putting too much CPU time into little to no gains
It is theoretically possible to design a compiler such that it spends much more time looking for optimizations that the total sum of looking is greater than the program it is optimizing's runtime.
For example, an optimizer that is a while loop that checks if the function returns 42, but the function returns 43.
I'm not sure what light that sheds.
I'm not sure that implies that compilers have tons of bespoke optimizations for hand-transforming specific instances of absurd string code.
If they do, I would be additionally surprised because I have never observed that. What I have observed is compilers, universally, optimize code structures of a certain general form
> This is why there's -O3 in gcc and it's not the default, there's a cost to it that's likely not worth paying.
The existence of an argument with a higher processing level than default does not imply the compiler is slow because there's thousands of bespoke optimizations for nonsense code being ran. (n.b. -O3 is understood, in practice, to be risky because it might be too aggressive, not that it might not be worth it)
matt_daemon 5 days ago
def is_even(n):
return str(n)[-1] in "02468"
dansalvato 5 days ago
gcc likes to use `and edi,1` (logical AND between 32-bit edi register and 1). Meanwhile, clang uses `test dil,1` which is similar, except the result isn't stored back in the register, which isn't relevant in my test case (it could be relevant if you want to return an integer value based on the results of the test).
After the logical AND happens, the CPU's ZF (zero) flag is set if the result is zero, and cleared if the result is not zero. You'd then use `jne` (jump if not equal) or maybe `cmovne` (conditional move - move register if not equal). Note again that there is no explicit comparison instruction. If you don't use O3, the compiler does produce an explicit `cmp` instruction, but it's redundant.
Now, the question is: Which is more efficient, gcc's `and edi,1` or clang's `test dil,1`? The `dil` register was added for x64; it's the same register as `edi` but only the lower 8 bits. I figured `dil` would be more efficient for this reason, because the `1` operand is implied to be 8 bits and not 32 bits. However, `and edi,1` encodes to 3 bytes while `test dil,1` encodes to 4 bytes. I guess the `and` instruction lets you specify the bit size of the operand regardless of the register size.
There is one more option, which neither compiler used: `shr edi,1` will perform a right shift on EDI, which sets the CF (carry) flag if a 1 is shifted out. That instruction only encodes to 2 bytes, so size-wise it's the most efficient.
The right-shift option fascinates me, because I don't think there's really a C representation of "get the bit that was right-shifted out". Both gcc and clang compile `(i >> 1) << 1 == i` the same as `i & 1 == 0` and `i % 2 == 0`.
Which of the above is most efficient on CPU cycles? Who knows, there are too many layers of abstraction nowadays to have a definitive answer without benchmarking for a specific use case.
I code a lot of Motorola 68000 assembly. On m68k, shifting right by 1 and performing a logical AND both take 8 CPU cycles. But the right-shift is 2 bytes smaller, because it doesn't need an extra 16 bits for the operand. That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn. Therefore, at least on m68k, shifting right is the fastest way to test if a value is even.
userbinator 4 days ago
In isolation it's the smallest, but it's no longer the smallest if you consider that the value, which in this example is the loop counter, needs to be preserved, meaning you'll need at least 2 bytes for another mov to make a copy. With test, the value doesn't get modified.
dansalvato 4 days ago
amiga386 4 days ago
There's also BTST #0,xx but it wastefully needs an extra 16 bits say which bit to test (even though the bit can only be from 0-31)
> That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn.
That's a load-bearing "could". If the 68000 has to read/write chip RAM, it gets the even cycles while the custom chips get odd cycles, so it doesn't even notice (unless you're doing something that steals even cycles from the CPU, e.g. the blitter is active and you set BLTPRI, or you have 5+ bitplanes in lowres or 3+ bitplanes in highres)
dansalvato 4 days ago
That reminds me, it's theoretically fastest to do `and d1,d0` e.g. in a loop if d1 is pre-loaded with the value (4 cycles and 1 read). `btst d1,d0` is 6 cycles and 1 read.
> the blitter is active and you set BLTPRI
I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles. But yes, I'm splitting hairs a bit when it comes to DMA performance because I code game/demo stuff targeting stock A500, meaning one of those cases (blitter running or 5+ bitplanes enabled) is very likely to be true.
amiga386 4 days ago
That's true, although I'd add that ASR/AND are destructive while BTST would be nondestructive, but we're pretty far down a chain of hypotheticals at this point (why would someone even need to test evenness in a loop, when they could unroll the loop to doing 2/4/6/8 items at a time with even/odd behaviour baked in)
> I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles
Yes, that is true: https://amigadev.elowar.com/read/ADCD_2.1/Hardware_Manual_gu... "If given the chance, the blitter would steal every available Chip memory cycle [...] If DMAF_BLITHOG is a 1, the blitter will keep the bus for every available Chip memory cycle [...] If DMAF_BLITHOG is a 0, the DMA manager will monitor the 68000 cycle requests. If the 68000 is unsatisfied for three consecutive memory cycles, the blitter will release the bus for one cycle."
> one of those cases is very likely to be true
It blew my mind when I realised this is probably why Workbench is 4 colours by default. If it were 8, an unexpanded Amiga would seem a lot slower to application/productivity users.
Arcuru 1 week ago
> I tried both versions (modulo 2 and bitwise AND) and got the same result. I think the optimizer recognizes modulo 2 and converts it to bitwise AND.
Yes, even without specifying optimizations - https://godbolt.org/z/9se9c6qKT
You can see that the output of the compiler is identical whether you use `i%2 == 0` or `(i&1) == 0`. The bitwise AND is instruction 12 in the output.
Using -O3 like in the post actually compiles to SIMD instructions on x86-64 - https://godbolt.org/z/dWbcK947G
ryan-c 5 days ago
thaumasiotes 5 days ago