367 points by CharlesW 19 hours ago | 108 comments
jhj 17 hours ago
Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.
This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.
Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).
iandanforth 16 hours ago
VladVladikoff 14 hours ago
bjornsing 9 hours ago
I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.
ironbound 6 hours ago
vessenes 13 hours ago
As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?
The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.
zorgmonkey 13 hours ago
refibrillator 11 hours ago
Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.
Classic comp sci tradeoff between space and speed, no free lunch, etc.
badmonster 19 hours ago
latchkey 16 hours ago
Or let one of the neoclouds take care of the infrastructure costs and rent it out from them. Disclosure: I run one of them.
airstrike 16 hours ago
Some unsolicited feedback: I would suggest reworking your landing page so that the language is always from your customers' perspective. Your customers want to solve a real internal problem that they have. Talking about how great your company is will always have less impact than talking about how you know what that problem is and how you intend to solve it.
Your mission is relevant to you and your investors, not to your customers. They care about themselves.
Your "quick start" should be an interactive form. I shouldn't have to remember what to put in an email to reach out to you. Make it easy for me. Also move that to the front page, provide a few "standard" packages and a custom one. Reduce the friction to clicking the CTA.
Since your pricing is transparent, you should be able to tell me what that price will be before I even submit a request. I assume you're cheaper than the competition (otherwise why would I not go with them?) so make that obvious. Check out Backblaze's website for an example page: https://www.backblaze.com/cloud-storage/pricing
Shell out a few grand and hire a designer to make your page look more professional. Something like https://oxide.computer/ but with the points above, as they also make the same mistake of making their home page read like a pitch deck.
latchkey 16 hours ago
Website is intended to be more like documentation instead of a pitch deck or useless splash with a contact us form. I dislike sites like Oxide, I scroll past and don't read or ingest any of the fancy parts. Of course, you're right, this probably needs to be less about me. =)
Friction definitely needs to be improved. That part is being worked on right now. Our intention is to be fully self-service, so that you don't have to talk to us at all, unless you want to. Credit card and go.
We recently lowered our prices to be competitive with the rest of the market vs. focusing on people who care more about what we offer. We weren't trying to be cheaper than everyone else, we were trying to offer a better service. Lesson learned and pricing adjusted. Streisand effect, I don't like to mention the other players much.
Again, thanks!
sundarurfriend 14 hours ago
For anyone else who hadn't heard of this term:
> Neoclouds are startups specializing in AI-specific cloud computing. Unlike their larger competitors, they don’t develop proprietary chips. Instead, they rely heavily on Nvidia’s cutting-edge GPUs to power their operations. By focusing solely on AI workloads, these companies offer specialized solutions tailored to AI developers’ needs.
from https://www.tlciscreative.com/the-rise-of-neoclouds-shaping-...
latchkey 14 hours ago
https://semianalysis.com/2024/10/03/ai-neocloud-playbook-and...
Ringz 13 hours ago
latchkey 12 hours ago
saagarjha 13 hours ago
latchkey 12 hours ago
It is novel equipment that few have ever used before outside of a relatively small HPC community. It regularly breaks and has issues (bugs) that need industry relationships to manage properly. We've had one server down for over a month now cause SMCI can't get their sh/t together to fix it. That's a $250k+ 350lbs paperweight. Good luck to any other small company that wants to negotiate that relationship.
We are offering a very valuable service by enabling easy access to some of the most powerful compute available today. How many people do you think have a good grasp of what it takes to configure rocev2 & 8x400G across a cluster of servers? Good luck trying to hire talent that can set that up, they already have jobs.
The capex / opex / complexity involved with deploying this level of gear is huge and only getting larger as the industry shifts to bigger/better/faster (ie: air cooling is dead). Things are moving so quickly, that equipment you purchased a year ago is now already out of date (H100 -> H200 is a great example). You're going to have to have a pretty impressive depreciation model to deploy this yourself.
I wouldn't just dismiss this as moving costs around.
zarathustreal 2 hours ago
…how do you justify marketing yourself in a system like that?
“In general, people in this vertical have difficulty doing their jobs. Luckily we’ve had drinks with most of them” ……
miohtama 17 hours ago
daveguy 17 hours ago
mhitza 16 hours ago
gunalx 15 hours ago
Der_Einzige 11 hours ago
If you live in a glass house, you won’t throw stones. No one in the LLM space wants to be litigious
It’s an open secret that DeepSeek used a ton of OpenAI continuations both in pre training and in the distillation. That totally violates openAI TOS. No one cares.
LoganDark 10 hours ago
Except for OpenAI.
Der_Einzige 11 hours ago
danielmarkbruce 18 hours ago
jhj 14 hours ago
Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.
If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.
striking 18 hours ago
kadushka 18 hours ago
latchkey 16 hours ago
MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288 HBM3e (and support FP4/6).
NBJack 16 hours ago
latchkey 16 hours ago
DrillShopper 15 hours ago
latchkey 14 hours ago
danielmarkbruce 17 hours ago
Nvidia about to release blackwell ultra with 288GB. Go back to maybe 2018 and max was 16gb if memory serves.
DeepSeek recently release a 670 gb model. A couple years ago Falcon's 180gb seemed huge.
spoaceman7777 17 hours ago
We've been stuck with the same general caps on standard GPU memory since then though. Perhaps limited in part because of the generational upgrades happening in the bandwidth of the memory, rather than the capacity.
danielmarkbruce 16 hours ago
A one time effective 30% reduction in model size simply isn't going to be some massive unlocker, in theory or in practice.
loufe 19 hours ago
jonplackett 19 hours ago
2 weeks? Two months? Two days? Two minutes?
All of the above are true sometimes! Exciting times indeed.
loufe 8 hours ago
Animats 16 hours ago
eoerl 15 hours ago