How-To Geek’s Post

How-To Geek

8,268 followers

10mo

Learn how to read GPU benchmarks and get the most bang for your buck.

How To Understand GPU Benchmarks

howtogeek.com

To view or add a comment, sign in

More Relevant Posts

Robert Lane

Data Analyst Predictive Analytics Insights Visualization Python SQL MongoDB Flask Jupyter Notebook Tableau
11mo
Report this post
I've been trying for several days to get pytorch running. Yes, I could have just used Google Colab, but I did end up learning something about the environment this way. Turns out that I wasn't using the environment that I had been installing things to. So now, I've learned how to automatically use every thread in the CPU, or switch to GPU (after having installed CUDA). My GPU isn't the most amazing one in the market, but this shows it's no slouch. 112 threads of Xeon, and my GPU is STILL 500 times faster in this case! This shows me why nVidia has exploded in recent years.
Like Comment
To view or add a comment, sign in
Modal

8,146 followers
7mo
Report this post
New on Modal: GPU Fallbacks ✨ You can now specify a list of possible GPU types for functions are compatible with multiple options - gpu=["h100", "a100"] below. We'll also respect the ordering of this list, trying to allocate the more preferred options before falling back to others. Read more here: https://lnkd.in/gkdQYqWe
3 Comments
Like Comment
To view or add a comment, sign in
Gurdaan Walia

Software / AI ML Engineer | Full Stack & DevOps
1y
Report this post
The next gen GPU’s are here..

Nvidia reveals Blackwell B200 GPU, the “world’s most powerful chip” for AI

theverge.com
Like Comment
To view or add a comment, sign in
Rubem Didini
10mo
Report this post
Low GPU utilization with the Decision Transformer - Models https://lnkd.in/dGGduthz

Low GPU utilization with the Decision Transformer

discuss.huggingface.co
Like Comment
To view or add a comment, sign in
TechSpot

2,026 followers
6mo
Report this post
This Doom port runs (almost) entirely on a GPU

This Doom port runs (almost) entirely on a GPU

techspot.com
Like Comment
To view or add a comment, sign in
Keith Kacsh

Principal DevOps Engineer - Creating cloud platforms and architecture to run your organization on
10mo
Report this post
I have always been a hardware guy, and even though I spend my days building cloud architecture in my free time (what's that?) I work with open source Large Language Models (LLMs) and AI in my workshop lab. The community has constant questions about hardware, GPU vs CPU, and the continual perceived feeling of evaluation tokens per second. This depends on your needs, model, quant size, bank account size, and what you perceive as acceptable. I also like to ask people... Can you process, and answer questions or write code (or iterate on code) as fast as a 6-year-old CPU? How about a nearly 10-year-old GPU? Let's try to apply some actual data to these types of questions. Using #localai #kubernetes #streamlit #langchain to run some tests. i9 CPU vs Tesla M40 (Old GPU) vs 4060TI (Moderate Budget Gaming GPU) vs A4500 (Enterprise GPU) https://lnkd.in/gftgfBt7

LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Sidney Rabsatt
7mo
Report this post
People or apps or both will need to get a lot chattier for anyone to make $1M selling LLMs. Speeds increase and prices drop, but the math doesn’t add up yet. If ~750,000 words costs $1, then 750,000,000,000 words = $1M. That’s a lot of chit chat! And at a recently published 450 tokens per second, it’d take 90 GPUs working around the clock for a year. And that doesn’t include power, cooling, people, etc. And the new models that appear every few months that need to be supported... Here's a quick comparison on how many GPUs each provider would need to put to work to make $1M on model inference (FTG = Full Time GPU @ 100% utilization).
1 Comment
Like Comment
To view or add a comment, sign in
Jack Teitel

Founder at Title AI | 10+ years of Custom AI Solutions for Healthcare, Computer Vision, and Beyond
8mo
Report this post
Here's a helpful formula to calculate required GPU memory for serving LLMs - as a rough guide I usually just double the number of params and that’s the amount of GB needed, but this formula takes quantization into account - nice for bigger models https://lnkd.in/g2FSds7c

Calculating GPU memory for serving LLMs

substratus.ai
Like Comment
To view or add a comment, sign in
Dr Raj Kalra MD

MD(Anesthesia,Urgent Care)UrologyResearch UPENN AI,AI(R),HC-AI,QCAI,PAI,XR SAI/Robo/NeuroAI/Multiomics/Energy/AgeReverse/DE Open for AIStartups seeking strategic investments. $$AI(R), Autonomous/AgenticAI
8mo
Report this post
GPU Performance- reduce Instruction Cache Misses

Improving GPU Performance by Reducing Instruction Cache Misses | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Anandaraj Pandian

Cloud Data/AI Engineer (Azure/AWS/GCP)| Machine learning | GenAI | Geospatial Analytics | Databricks
6mo
Report this post
How much stress can your server handle when you're self-hosting LLMs? In the transition from prototype to production, this is a crucial question. 🧐 When we host models like Llama 3.1 8B, we have to make infrastructure decisions that affect performance, cost, and user experience. Here’s what I recently learned through load testing: 🎯 Tested different infrastructure options: from using a single A40 GPU to scaling up to two GPUs and finally upgrading to an L40S GPU. I simulated real-world traffic using Postman to fire requests from 50 to 100 virtual users. Here’s what happened: 🖥️ A single A40 GPU handled traffic reasonably well but had slower response times. 🔄 Doubling the GPUs provided only a marginal improvement—not worth the cost. 🚀 The L40S GPU was faster but far more expensive. Despite its performance, the costs didn’t justify the switch for general use. For most use cases, sticking with a single A40 GPU gave the best balance of performance and cost-efficiency. But, for mission-critical tasks, upgrading to newer-generation GPUs like the L40S might be necessary. 💡 Key takeaway: Always load test your infrastructure before scaling! It helps you make informed decisions about how to handle user traffic without overspending. Postman is a great tool for this, and it's free! For a deeper dive into this, check out article: https://lnkd.in/eUKMj5gT #infrastructure #AI #LLMs #LoadTesting #GPU #Postman #CostEfficiency

How Much Stress Can Your Server Handle When Self-Hosting LLMs?

towardsdatascience.com
Like Comment
To view or add a comment, sign in

8,268 followers

View Profile Connect

How-To Geek’s Post

More Relevant Posts

LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500

https://www.youtube.com/

Explore topics