Rubem Didini’s Post

10mo

Low GPU utilization with the Decision Transformer - Models https://lnkd.in/dGGduthz

Low GPU utilization with the Decision Transformer

discuss.huggingface.co

To view or add a comment, sign in

More Relevant Posts

Jack Teitel

Founder at Title AI | 10+ years of Custom AI Solutions for Healthcare, Computer Vision, and Beyond
8mo
Report this post
Here's a helpful formula to calculate required GPU memory for serving LLMs - as a rough guide I usually just double the number of params and that’s the amount of GB needed, but this formula takes quantization into account - nice for bigger models https://lnkd.in/g2FSds7c

Calculating GPU memory for serving LLMs

substratus.ai
Like Comment
To view or add a comment, sign in
Amit Bahree

Principal GPM - AI Platform @ Microsoft | Applied AI Engineering, Azure OpenAI, LLMs, SLMs, ML, Cognitive Services, Author
9mo
Report this post
Building scaled #distributed systems is hard. Laws of computer science and physics.
Soumith Chintala

PyTorch. Robots. Research @ Meta
9mo

Why do 16k GPU jobs fail? The Llama3 paper has many cool details -- but notably, has a huge infrastructure section that covers how we parallelize, keep things reliable, etc. We hit an overall 90% effective-training-time. Read the full paper here: https://lnkd.in/e56fbzfj
Like Comment
To view or add a comment, sign in
Bobak Hashemi

LLMs at Meta, PhD particle physics
9mo
Report this post
This is the first discussion I'm aware of that covers the nuts and bolts of training frontier scale LLMs, very good read for those who are hoping to train large models
Soumith Chintala

PyTorch. Robots. Research @ Meta
9mo

Why do 16k GPU jobs fail? The Llama3 paper has many cool details -- but notably, has a huge infrastructure section that covers how we parallelize, keep things reliable, etc. We hit an overall 90% effective-training-time. Read the full paper here: https://lnkd.in/e56fbzfj
Like Comment
To view or add a comment, sign in
Reza Mokhtari

Director, Cluster Architecture at Cerebras Systems
9mo Edited
Report this post
The Meta Llama 3.1 paper has a section on the HW/SW infrastructure to enable training of the model with 16k GPUs. It spans 7 pages and talks about MANY engineering challenges, decisions, trade-offs, and novel ideas to make such training possible and efficient (see the list below). This is an insane amount of work by many, many world-class engineers. Unfortunate, however, that almost all of it is required only because GPUs are not the right tool for this job! If only a better solution existed.. One that could reduce this list to zero, making it possible for all kinds of companies (not just those with top-notch engineering teams) to achieve similar results (it does exist, search for Cerebras CS-3). Some of the challenges/decisions mentioned in the paper: - Handling bursty checkpoint writes to minimize GPU pause time. - Ensuring equivalent performance across Infiniband and RoCE networks. - Employing a three-layer Clos network to minimize pod communication. - Avoiding congestion using deep-buffer switches and Enhanced-ECMP protocol. - Addressing memory and computation imbalances in pipeline parallelism. - Partitioning sequences for memory efficiency and training on sequences up to 128K tokens. - Optimizing parallelism dimensions to reduce communication overhead. - Developing a memory consumption estimator to explore various parallelism configurations. - Creating NCCLX, a customized version of nvidia's NCCL, to improve performance on higher latency networks. - Manual management of job interruptions along with automating recovery to increase training time. - Identifying load/store stalls that are in fact a failing remote GPU/NVLINK - Handling power consumption changes due to synchronized GPU activities. - Managing performance variations caused by diurnal temperature changes. - Debugging mixed NVLink and RoCE network issues.
Soumith Chintala

PyTorch. Robots. Research @ Meta
9mo

Why do 16k GPU jobs fail? The Llama3 paper has many cool details -- but notably, has a huge infrastructure section that covers how we parallelize, keep things reliable, etc. We hit an overall 90% effective-training-time. Read the full paper here: https://lnkd.in/e56fbzfj
2 Comments
Like Comment
To view or add a comment, sign in
Jason Bulpin

Technical Sales | Capital Markets, Banking & Data Centre AI Specialist - CCIE #47298
6mo
Report this post
What will happen to older generation GPU clusters once we update, can we repurpose general compute applications to run on them? What will businesses do with these assets

6 Comments
Like Comment
To view or add a comment, sign in
Jongsoo Park

Member of Technical Staff at OpenAI
9mo Edited
Report this post
Llama3 is the result of a lot of hard work by Meta GenAI and infra teams (including AI HW/SW co-design). It's great to see many details on infrastructure are shared in the paper (https://lnkd.in/e56fbzfj) including parallelization, collective communication, fp8 quantization, data center design/operations, and so on. Some of the techniques, for example parallelization methods, were known from research papers but we ran into various challenges when applied in practice at this scale (more can be found in the paper).
Soumith Chintala

PyTorch. Robots. Research @ Meta
9mo

Why do 16k GPU jobs fail? The Llama3 paper has many cool details -- but notably, has a huge infrastructure section that covers how we parallelize, keep things reliable, etc. We hit an overall 90% effective-training-time. Read the full paper here: https://lnkd.in/e56fbzfj
2 Comments
Like Comment
To view or add a comment, sign in
Blake Mallory

Machine Learning Engineer, Educator, Freelancer
9mo
Report this post
Training large language models (LLMs) in AI is a colossal task, often requiring thousands of GPUs. The sheer scale leads to frequent training failures due to hardware issues, a modern nod to the "law of large numbers." This reminds me of the origin of the term "computer bug," when actual insects caused malfunctions by interfering with transistors. Today, "bugs" are more likely to be hardware failures among the 16,000 GPUs needed to train these advanced models.
Soumith Chintala

PyTorch. Robots. Research @ Meta
9mo

Why do 16k GPU jobs fail? The Llama3 paper has many cool details -- but notably, has a huge infrastructure section that covers how we parallelize, keep things reliable, etc. We hit an overall 90% effective-training-time. Read the full paper here: https://lnkd.in/e56fbzfj
Like Comment
To view or add a comment, sign in
Apurva Nagvenkar

Director, Data Science at Trinka AI
7mo
Report this post
The vLLM v0.6.0 update introduces significant performance improvements, with throughput increases of 1.8-2.7x compared to v0.5.3. The performance bottleneck in previous versions was due to high CPU overhead and lack of asynchronicity, which led to 62% of processing time being spent on non-GPU tasks. v0.6.0 tries to resolves this by following: 1. Separating the API server and inference engine into different processes. 2. Batching multiple scheduling steps. 3. Implementing asynchronous output processing for better GPU utilization. Blog: https://lnkd.in/dnETM2fx

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

blog.vllm.ai
Like Comment
To view or add a comment, sign in
Sagar Desai

Senior Solutions Architect GenAI
8mo
Report this post
Understanding nvidia-smi The "GPU-Util" percentage reported by nvidia-smi is often misunderstood. The "GPU-Util" percentage measures the fraction of time when at least one kernel is executing on the GPU. That's it. 100% GPU-Util doesn't mean the GPU is fully utilized or busy with intense computations. It means at least one thread in one kernel is executing. Low GPU-Util doesn't mean the GPU is idle. It could mean kernels are executing, but with significant idle time between them or with many threads waiting for memory accesses or other dependencies. Get a More Accurate Picture Use other metrics to get a more accurate picture of GPU utilization: Memory bandwidth utilization Compute utilization Power utilization

1 Comment
Like Comment
To view or add a comment, sign in
How-To Geek

8,268 followers
10mo
Report this post
Learn how to read GPU benchmarks and get the most bang for your buck.

How To Understand GPU Benchmarks

howtogeek.com
Like Comment
To view or add a comment, sign in

1,941 followers

1,253 Posts

View Profile Follow

Rubem Didini’s Post

More Relevant Posts

Explore topics