Learn how to read GPU benchmarks and get the most bang for your buck.
How-To Geek’s Post
More Relevant Posts
-
I've been trying for several days to get pytorch running. Yes, I could have just used Google Colab, but I did end up learning something about the environment this way. Turns out that I wasn't using the environment that I had been installing things to. So now, I've learned how to automatically use every thread in the CPU, or switch to GPU (after having installed CUDA). My GPU isn't the most amazing one in the market, but this shows it's no slouch. 112 threads of Xeon, and my GPU is STILL 500 times faster in this case! This shows me why nVidia has exploded in recent years.
To view or add a comment, sign in
-
-
New on Modal: GPU Fallbacks ✨ You can now specify a list of possible GPU types for functions are compatible with multiple options - gpu=["h100", "a100"] below. We'll also respect the ordering of this list, trying to allocate the more preferred options before falling back to others. Read more here: https://lnkd.in/gkdQYqWe
To view or add a comment, sign in
-
-
Low GPU utilization with the Decision Transformer - Models https://lnkd.in/dGGduthz
To view or add a comment, sign in
-
I have always been a hardware guy, and even though I spend my days building cloud architecture in my free time (what's that?) I work with open source Large Language Models (LLMs) and AI in my workshop lab. The community has constant questions about hardware, GPU vs CPU, and the continual perceived feeling of evaluation tokens per second. This depends on your needs, model, quant size, bank account size, and what you perceive as acceptable. I also like to ask people... Can you process, and answer questions or write code (or iterate on code) as fast as a 6-year-old CPU? How about a nearly 10-year-old GPU? Let's try to apply some actual data to these types of questions. Using #localai #kubernetes #streamlit #langchain to run some tests. i9 CPU vs Tesla M40 (Old GPU) vs 4060TI (Moderate Budget Gaming GPU) vs A4500 (Enterprise GPU) https://lnkd.in/gftgfBt7
LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500
https://www.youtube.com/
To view or add a comment, sign in
-
People or apps or both will need to get a lot chattier for anyone to make $1M selling LLMs. Speeds increase and prices drop, but the math doesn’t add up yet. If ~750,000 words costs $1, then 750,000,000,000 words = $1M. That’s a lot of chit chat! And at a recently published 450 tokens per second, it’d take 90 GPUs working around the clock for a year. And that doesn’t include power, cooling, people, etc. And the new models that appear every few months that need to be supported... Here's a quick comparison on how many GPUs each provider would need to put to work to make $1M on model inference (FTG = Full Time GPU @ 100% utilization).
To view or add a comment, sign in
-
-
Here's a helpful formula to calculate required GPU memory for serving LLMs - as a rough guide I usually just double the number of params and that’s the amount of GB needed, but this formula takes quantization into account - nice for bigger models https://lnkd.in/g2FSds7c
To view or add a comment, sign in
-
GPU Performance- reduce Instruction Cache Misses
To view or add a comment, sign in
-
How much stress can your server handle when you're self-hosting LLMs? In the transition from prototype to production, this is a crucial question. 🧐 When we host models like Llama 3.1 8B, we have to make infrastructure decisions that affect performance, cost, and user experience. Here’s what I recently learned through load testing: 🎯 Tested different infrastructure options: from using a single A40 GPU to scaling up to two GPUs and finally upgrading to an L40S GPU. I simulated real-world traffic using Postman to fire requests from 50 to 100 virtual users. Here’s what happened: 🖥️ A single A40 GPU handled traffic reasonably well but had slower response times. 🔄 Doubling the GPUs provided only a marginal improvement—not worth the cost. 🚀 The L40S GPU was faster but far more expensive. Despite its performance, the costs didn’t justify the switch for general use. For most use cases, sticking with a single A40 GPU gave the best balance of performance and cost-efficiency. But, for mission-critical tasks, upgrading to newer-generation GPUs like the L40S might be necessary. 💡 Key takeaway: Always load test your infrastructure before scaling! It helps you make informed decisions about how to handle user traffic without overspending. Postman is a great tool for this, and it's free! For a deeper dive into this, check out article: https://lnkd.in/eUKMj5gT #infrastructure #AI #LLMs #LoadTesting #GPU #Postman #CostEfficiency
To view or add a comment, sign in