How Much Compute Does Your LLM Actually Need?
A practical guide to estimating GPU memory, token costs, and the hardware behind AI models
Last fall, a friend of mine — a backend engineer at a mid-size startup — got tapped to figure out whether his company could run their own language model instead of paying OpenAI per token. His boss had read something about Llama 2 being "open source and free" and figured, how hard could it be? Just download it and run it.
Three weeks and several surprise cloud bills later, my friend had learned a lesson that the AI industry is remarkably bad at communicating: knowing a model has 70 billion parameters tells you almost nothing about what it takes to actually run the thing. The number sounds impressive in a press release. But what does it mean in terms of GPUs you need to rent? How much VRAM? What will it cost per month? These are the questions that matter when you're the one writing the check.
I've been poking at this problem for a while now, and I built an LLM resource calculator to help make the math concrete. But before you go punch numbers into a tool, it's worth understanding the intuition behind why these calculations work the way they do. Because the relationship between parameter count and hardware requirements isn't just linear — it's shaped by precision formats, batch sizes, and a bunch of other knobs that can change the answer by 4x or more.
The Parameter-to-Memory Pipeline
Here's the thing nobody explains well: a "parameter" in a neural network is just a number. A floating-point number. And in the standard FP16 (half-precision) format that most models use for inference, each parameter takes up 2 bytes of memory.
So, a 7-billion-parameter model at FP16 needs roughly 14 GB of VRAM just to load the weights. That's the model sitting in memory doing absolutely nothing yet — no input processing, no text generation, just existing. An NVIDIA A100 has 80 GB of VRAM in its beefier configuration, so a 7B model fits comfortably on one card with plenty of room left for the activations and KV cache that pile up during actual inference.
A 13B model? About 26 GB. Still fits on one A100. You could even squeeze it onto a consumer RTX 4090 with its 24 GB, though you'd be cutting it close once you account for overhead.
But a 70B model at FP16? That's roughly 140 GB. No single GPU on the market holds that much memory. You need at least two A100-80GB cards, and realistically you'd want four to leave enough headroom. This is where the costs start to get genuinely alarming. Renting a single A100 on a major cloud provider runs somewhere around $1.50 to $3.00 per hour. Four of them, running 24/7? You're looking at $15,000 to $30,000 a month before you've served a single user request.
Quantization: The Best Trick Nobody Told You About
This is where things get interesting. Remember how each parameter takes 2 bytes at FP16? What if you could represent each parameter with fewer bits?
That's quantization. INT8 quantization represents each weight as an 8-bit integer instead of a 16-bit float. Your 7B model that needed 14 GB now needs about 7 GB. The 70B model drops from 140 GB to around 70 GB — suddenly you can fit it on a single high-end GPU. Go further to INT4, and that 70B model is down to approximately 35 GB. That's a single A100-40GB or even a pair of consumer cards.
The tradeoff, predictably, is quality. You're compressing the model's knowledge into a smaller numerical representation. Think of it like JPEG compression for images — at high quality settings you can barely tell the difference, but crank the compression and things start to get blurry. In practice, INT8 quantization on well-made models like Llama 3 produces output that's remarkably close to the FP16 original. Most users can't tell the difference. INT4 introduces more noticeable degradation, especially on complex reasoning tasks, but for straightforward text generation it's often good enough.
The libraries for doing this have gotten shockingly good. GPTQ, AWQ, and bitsandbytes make it almost trivially easy to quantize a model and start serving it. Two years ago this was PhD-level work. Now it's a pip install and a flag.
Inference vs. Fine-Tuning: Two Very Different Beasts
There's a crucial distinction that gets conflated constantly: running a model (inference) and training or fine-tuning a model are completely different workloads with completely different hardware requirements.
Inference is relatively gentle. You load the model weights into VRAM, feed in a prompt, and generate tokens one at a time. The memory footprint is dominated by the model weights themselves plus a KV cache that grows with sequence length. For a 7B model serving one request at a time, you might need 16-20 GB total.
Fine-tuning is a different animal. You need the model weights, yes, but also the optimizer states (which in Adam are two additional copies of every parameter), the gradients, and the activations for backpropagation. A rough rule of thumb: full fine-tuning requires about 4-6x the memory of inference. That 7B model that needed 14 GB to run? Full fine-tuning at FP16 needs closer to 60-80 GB. A 70B model? You're looking at multiple nodes with 8 GPUs each. The numbers get absurd fast.
Which is exactly why LoRA (Low-Rank Adaptation) took off. Instead of fine-tuning all 7 billion parameters, you freeze the base model and train a small adapter — typically adding less than 1% new parameters. Suddenly that 70B fine-tuning job that needed a cluster fits on a couple of GPUs.
Batch Size and Throughput: Where the Economics Get Real
If you're running a model for yourself — local Llama on your gaming PC — batch size is basically 1 and throughput doesn't matter much. But the moment you're serving multiple users, batch size becomes the single most important knob for cost efficiency.
Here's why: GPUs are massively parallel processors. They're designed to do the same operation on thousands of data points simultaneously. When you process one request at a time, most of those parallel units sit idle. The model weights are loaded into memory and the GPU is technically "busy," but it's only doing a fraction of the work it could handle. Processing a batch of 8 or 16 requests simultaneously barely increases latency for any individual request, but it multiplies your throughput dramatically.
The catch is memory. Each concurrent request needs its own KV cache, which for long-context models can be substantial. A single request with a 4,096-token context on a 7B model might use 500 MB of KV cache. Batch 32 of those, and you've added 16 GB of memory overhead on top of the model weights. This is where careful planning — figuring out your target batch size, your average sequence length, and your total VRAM budget — becomes essential. And it's exactly the kind of calculation that's tedious to do by hand but straightforward with the right tool.
The Cost Per Token Question
Most people encounter LLMs through APIs where the pricing is per token — something like $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens for a frontier model. Those numbers seem tiny until you start doing volume. A customer support chatbot handling 10,000 conversations per day, each averaging 2,000 tokens? That's 20 million tokens daily, or roughly $600/day just on output tokens. Over $18,000 a month.
Self-hosting flips the cost structure. Instead of paying per token, you're paying for GPU time — and whether you generate 1 token or 1 million in an hour, the hourly rental price is the same. The breakeven point depends on your volume, but for many companies doing serious LLM workloads, self-hosting a quantized open model becomes cheaper somewhere around 50,000-100,000 requests per day. Below that, the API is almost always the better deal once you factor in the engineering time to run your own infrastructure.
A detail that often gets missed in these calculations: the cost isn't just the GPU. You need fast storage to load model weights (NVMe, ideally), high-bandwidth networking if you're splitting across multiple GPUs, and someone who actually knows how to operate this stuff. The pure GPU math is necessary but not sufficient for a real cost estimate.
The Environmental Bill
All this compute has a physical cost that I think the AI industry is weirdly reluctant to talk about honestly. An A100 GPU draws about 300 watts under load. Four of them running 24/7 to serve a 70B model consume roughly 28.8 kWh per day — comparable to what an average American household uses. And that's just the GPUs, not the cooling, networking, storage, and other datacenter overhead that roughly doubles the total energy draw.
I wrote about this more extensively in a piece on the energy consumption of AI models, and the numbers are worth sitting with. Every token generated has an energy cost. Every model hosted around the clock has a carbon footprint. Quantization doesn't just save you money — it cuts energy consumption proportionally. Running a 70B model at INT4 instead of FP16 doesn't just mean fewer GPUs; it means roughly half the electricity draw. If you care about the environmental impact of your AI infrastructure, choosing the most efficient precision format that meets your quality bar is one of the highest-leverage decisions you can make.
The water usage is another factor that gets overlooked. Datacenters use enormous amounts of water for cooling. Microsoft reported a 34% increase in water consumption in 2022, which they attributed largely to AI workloads. When you're planning infrastructure, the energy and water costs deserve a line item right next to the GPU rental.
A Practical Cheat Sheet
After spending way too much time thinking about this, here's the mental model I use for quick estimation. For inference at FP16, multiply the parameter count (in billions) by 2 to get the VRAM in gigabytes. So 7B needs ~14 GB, 13B needs ~26 GB, 70B needs ~140 GB. For INT8, multiply by 1. For INT4, multiply by 0.5. These are just the model weights — add 20-50% overhead for KV cache and activations depending on your sequence length and batch size.
For fine-tuning with LoRA at FP16, take the inference requirement and add about 20-30%. For full fine-tuning, multiply the base weight size by 4-6x. These are rough numbers, but they'll get you in the right neighborhood for initial planning.
Then map that to hardware. An RTX 3090 or 4090 has 24 GB of VRAM. An A100 comes in 40 GB and 80 GB versions. An H100 has 80 GB. From there, the math is division: how many GPUs do you need to hold your total memory requirement? Multiply by the hourly rate, and you've got a monthly cost estimate.
Or — and I genuinely think this is the smarter move — just plug your numbers into the LLM resource calculator and skip the back-of-envelope arithmetic. It factors in model size, precision format, batch size, and gives you GPU memory requirements, estimated cost, and throughput projections. It's the kind of tool I wish existed when my friend was trying to figure out if his startup could afford to self-host.
Start With the Numbers, Not the Hype
The AI industry has a marketing problem: it sells capability without context. A new model announcement will trumpet "405 billion parameters!" as if that number alone tells you something actionable. It doesn't. What tells you something actionable is: this model needs roughly 810 GB of VRAM at FP16, which means a minimum of eleven A100-80GB GPUs, which will cost your team somewhere around $40,000-$60,000 per month in cloud rental. That's the information you need to make a decision.
The gap between "we could use AI for this" and "we can actually afford to run AI for this" is where most projects stall. Not because the technology doesn't work, but because nobody did the infrastructure math until three weeks into the proof of concept. Do the math first. Figure out what hardware you need, what it will cost, and what tradeoffs (quantization, smaller models, API vs. self-hosting) are available. Then decide.
The compute requirements of LLMs aren't mysterious. They're just arithmetic that nobody bothered to make accessible. That's what I'm trying to fix.