173 points by samaysharma 5 days ago | 21 comments
r0b05 5 days ago
Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?
ethan_smith 5 days ago
zettabomb 5 days ago
StochasticLi 5 days ago
0xjunhao 5 days ago
zackangelo 5 days ago
0xjunhao 3 days ago
criemen 5 days ago
I didn't quite get
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?
animan 5 days ago
Instead for decode, you need to sequentially generate each token.
criemen 5 days ago
animan 5 days ago
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token
dist-epoch 5 days ago
0xjunhao 4 days ago
3abiton 5 days ago
longbeachbass 5 days ago
Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.
Is this happening at the load balancer layer?
cyanf 5 days ago
hhh 5 days ago
0xjunhao 4 days ago
Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.
gdiamos 5 days ago
There is more perf you can sqeeuze out of vLLM
mhlakhani 5 days ago
geoffbp 5 days ago