88 points by vortex_ape 7 hours ago | 26 comments
comex 6 hours ago
First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:
https://github.com/llvm/llvm-project/pull/102885
Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")
hinkley 3 hours ago
I think the days of trying to branches by trying to remove conditional assignments are either gone or close to it. You may still have a subsequent data race, but the conditional assignment isn't your biggest problem with throughput.
achierius 55 minutes ago
deathanatos 2 hours ago
/// Encode a UTF-8 codepoint.
/// […]
/// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
/// it's up to the caller to turn that into U+FFFD, or return an error.
It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)
(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)
orlp 5 hours ago
branchless_utf8:
mov rax, rdi
lzcnt ecx, esi
lea rdx, [rip + .L__unnamed_1]
movzx ecx, byte ptr [rcx + rdx]
lea rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
pdep esi, esi, dword ptr [rdx + 8*rcx]
or esi, dword ptr [rdx + 8*rcx + 4]
movbe dword ptr [rdi], esi
mov qword ptr [rdi + 8], rcx
ret
The code: static DEP_AND_OR: [(u32, u32); 5] = [
(0, 0),
(0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
(0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
(0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
(0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
];
const LEN: [u8; 33] = [
// 0-10 leading zeros: not valid.
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
// 11-15 leading zeros: 4 bytes.
4, 4, 4, 4, 4,
// 16-20 leading zeros: 3 bytes.
3, 3, 3, 3, 3,
// 21-24 leading zeros: 2 bytes.
2, 2, 2, 2,
// 25-32 leading zeros: 1 byte.
1, 1, 1, 1, 1, 1, 1, 1,
];
pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
let leading_zeros = codepoint.leading_zeros() as usize;
let bytes = LEN[leading_zeros] as usize;
let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
(ret.swap_bytes().to_le_bytes(), bytes)
}
koala_man 4 hours ago
jsheard 3 hours ago
JS runtimes do use it but it's useful anywhere you need to do what x86 does, which obviously includes running x86 binaries under emulation.
HeliumHydride 4 hours ago