409 points by gnabgib 1 day ago | 150 comments
divbzero 20 hours ago
The optimizations that compilers can achieve kind of amaze me.
Indeed, the latest version of cal from util-linux keeps it simple in the C source:
return ( !(year % 4) && (year % 100) ) || !(year % 400);
https://github.com/util-linux/util-linux/blob/v2.41/misc-uti...cookiengineer 15 hours ago
For more on this, I recommend reading and implementing a function that calculates the day of the week [1]. Then you can join me in the special insanity hell of people that were trying to deal with human calendars.
And then you should implement a test case for the dates between Thursday 4 October 1582 and Friday 15 October 1582 :)
[1] https://en.m.wikipedia.org/wiki/Determination_of_the_day_of_...
LegionMammal978 9 hours ago
The problem is, which "specific" year? The English were using "old-style" dates long after 1582. Better not to try to solve this intractable problem in software, but instead annotate every old date you receive with its correct calendar, which may even be a proleptic Gregorian calendar in some fields of study.
(How do you determine the correct calendar? Through careful inspection of context! Alas, people writing the dates rarely indicated this, and later readers tend to get the calendars hopelessly mangled up. Not to mention the changes in the start of the year. At least the day of week can act as an indicator, when available.)
voxic11 8 hours ago
static int leap_year(const struct cal_control *ctl, int32_t year)
{
if (year <= ctl->reform_year)
return !(year % 4);
return ( !(year % 4) && (year % 100) ) || !(year % 400);
}
Where reform_year is the year the Gregorian calendar was adopted in the specific context specified (defaults to 1752 which is the year it was adopted by GB and therefore also the US).So it does account for Julian dates.
divbzero 8 hours ago
$ cal 9 1752
September 1752
Su Mo Tu We Th Fr Sa
1 2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
madcaptenor 6 hours ago
$ ncal -sIT 10 1582
October 1582
Mo 1 18 25
Tu 2 19 26
We 3 20 27
Th 4 21 28
Fr 15 22 29
Sa 16 23 30
Su 17 24 31
France apparently took a couple months to get on board (or maybe just to find out): $ncal -sFR 12 1582
December 1582
Mo 3 20 27
Tu 4 21 28
We 5 22 29
Th 6 23 30
Fr 7 24 31
Sa 1 8 25
Su 2 9 26
`ncal -p` gives a list of the country codes it accepts. (These are current countries so it's a bit ahistorical for, say, Germany.)Sadly they don't implement the weird thing Sweden did in the early 18th century: https://en.wikipedia.org/wiki/Swedish_calendar
sigmoid10 16 hours ago
darkwater 16 hours ago
if ((y % 25) != 0) return true;
was actually checking for different from 0 (which in hindsight makes also sense because the century years by default are not leap unless they divide by 400)_heimdall 23 hours ago
> How does this work? The answer is surprisingly complex.
I don't think anyone is surprised in the complexity of any explanation for that algorithm :D
qingcharles 1 day ago
Does anyone have a collection of these things?
ryao 1 day ago
https://graphics.stanford.edu/~seander/bithacks.html
It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.
The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:
https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...
It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.
There are a few more superoptimized mathematical operations listed here:
https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...
Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.
Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:
https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...
There used to be an Open Solaris blog post on them, but Oracle has taken it down.
Enjoy!
JdeBP 22 hours ago
eru 17 hours ago
I guess this only applies when the compiler knows what version of > you are using?
Eg it might not work in C++ when < and > are overloaded for eg strings?
ryao 6 hours ago
https://godbolt.org/z/nGbPhz86q
If you did not inline the operator overloads and had them in another compilation unit, do not expect this to simplify (unless you use LTO).
If you have compound comparators in the operator overloads (such that on equality in one field, it considers a second for a tie breaker), I would not expect it to simplify, although the compiler could surprise me.
trollbridge 14 hours ago
kmoser 17 hours ago
ryao 9 hours ago
Even if you only need one of cosf() or sinf(), many CPUs calculate both values at the same time, so taking the other is free. If you only need single precision values, you can do this in double precision to avoid much of the errors you would get by doing this in single precision.
This trick can be used to accelerate the RoPE relative positional encoding calculations used in inference for llama 3 and likely others. I have done this and seen a measurable speed up, although these calculations are such a small part of inference that it was a small improvement.
owl_vision 1 day ago
qingcharles 1 day ago
tylerhou 1 day ago
mshockwave 20 hours ago
tylerhou 19 hours ago
masfuerte 1 day ago
JdeBP 20 hours ago
It was that generally the fast hardware multiplication operations in ALUs didn't have very many bits in the register word length, so multiplications of wider words had to be done with library functions that did long multiplication in (say) base 256.
So this code in the headlined article would not be "three instructions" but three calls to internal helper library functions used by the compiler for long-word multiplication, comparison, and bitwise AND; not markedly more optimal than three internal helper function calls for the three original modulo operations, and in fact less optimal than the bit-twiddled modulo-powers-of-2 version found halfway down the headlined article, which would only need check the least significant byte and not call library functions for two of the 32-bit modulo operations.
Bonus points to anyone who remembers the helper function names in Microsoft BASIC's runtime library straight off the top of xyr head. It is probably a good thing that I finally seem to have forgotten them. (-: They all began with "B$" as I recall.
kruador 13 hours ago
There's also the difference between multiplying by a hard-coded value, which can be implemented with shifts and adds, and multiplying two variables, which has to be done with an algorithm.
The 8086 did have multiply instructions, but they were implemented as a loop in the microcode, adding the multiplicand, or not, once for each bit in the multiplier. More at https://www.righto.com/2023/03/8086-multiplication-microcode.... Multiplying by a fixed value using shifts and adds could be faster.
The prototype ARM1 did not have a multiply instruction. The architecture does have a barrel shifter which can shift one of the operands by any number of bits. For a fixed multiplication, it's possible to compute multiplying by a power of two, by (power of two plus 1), or by (power of two minus 1) in a single instruction. The latter is why ARM has both a SUB (subtract) instruction, computing rd := rs1 - Operand2, and a RSB (Reverse SuBtract) instruction, computing rd := Operand2 - rs1. The second operand goes through the barrel shifter, allowing you to write an instruction like 'RSB R0, R1, R1, #4' meaning 'R0 := (R1 << 4) - R1', or in other words '(R1 * 16) - R1', or R1 * 15.
ARMv2 added in MUL and MLA (MuLtiply and Accumulate) instructions. The hardware ARM2 implementation uses a Booth's encoder to multiply 2 bits at a time, taking up to 16 cycles for 32 bits. It can exit early if the remaining bits are all 0s.
Later ARM cores implemented an optional wider multiplier (that's the 'M' in 'ARM7TDMI', for example) that could multiply more bits at a time, therefore executing in fewer cycles. I believe ARM7TDMI was 8-bit, completing in up to 4 cycles (again, offering early exit). Modern ARM cores can do 64-bit multiplies in a single cycle.
cbm-vic-20 13 hours ago
eru 17 hours ago
Well, we can actually multiply long binary numbers asymptotically faster than Ancient Egyptians.
kens 8 hours ago
It turns out that multiplication in modern ALUs is very different. The Pentium, for instance, does multiplication using base-8, not base-2, cutting the number of additions by a factor of 3. It also uses Booth's algorithm, so much of the time it is subtracting, not adding.
godelski 23 hours ago
https://youtube.com/watch?v=PpaQrzoDW2I
kurthr 1 day ago
genewitch 1 day ago
bobmcnamara 22 hours ago
ryao 24 hours ago
qingcharles 1 day ago
Someone 6 hours ago
dahart 10 hours ago
is_leap_year(unsigned int):
xor eax, eax
test dil, 3
jne .L1
imul edi, edi, -1030792151
mov eax, 1
mov edx, edi
ror edx, 2
cmp edx, 42949672
ja .L1
ror edi, 4
cmp edi, 10737418
setbe al
.L1:
ret
ryao 5 hours ago
https://github.com/openzfs/spl/commit/8fc851b7b5315c9cae9255...
Jason had noticed that GCC’s assembly output did not match the original macro when looking for a solution to the unsigned integer overflow warning that a PaX GCC plugin had output (erroneously in my opinion). He had conjectured we could safely adopt GCC’s version as a workaround. I gave him the proof of correctness for the commit message and it was accepted into ZFS. As you can see from the proof, deriving that from the original required 4 steps. I assume that GCC had gone through a similar process to derive its output.