Ted Werbel’s Post

AI Engineer @ Inngest

4mo

ModernBERT is an open-source embedding model which represents a significant leap in encoder-only transformer architecture merging: > Long context (8k tokens) > Local-global attention for efficiency > Up-to-date code and text training data (2 trillion tokens) > FlashAttention & Unpadding to push performance on consumer and server GPUs > High retrieval and classification accuracy that sets new benchmarks across GLUE, BEIR, code tasks, and more. Learn more about this exciting new embedding model here: https://lnkd.in/en5ud9VE

ModernBERT: A New Era for Encoder-Only Transformers | AgentDesk.ai

agentdesk.ai

To view or add a comment, sign in

More Relevant Posts

Chris Fey

A.I. Consultant
5mo
Report this post
Pixtral 12B in short: Natively multimodal, trained with interleaved image and text data Strong performance on multimodal tasks, excels in instruction following Maintains state-of-the-art performance on text-only benchmarks Architecture: New 400M parameter vision encoder trained from scratch 12B parameter multimodal decoder based on Mistral Nemo Supports variable image sizes and aspect ratios Supports multiple images in the long context window of 128k tokens Use: License: Apache 2.0 Try it on La Plateforme or on Le Chat

Announcing Pixtral 12B

mistral.ai
Like Comment
To view or add a comment, sign in
Augment Code

8,091 followers
5mo Edited
Report this post
Why GPU utilization for LLM inference is trickier than you might think 👇 Naive LLM inference on GPUs uses <1% of available compute power. The reason? It's all about the balance between memory bandwidth and FLOPS (floating-point operations per second) Think of GPU processing like this: - Memory bandwidth: Loads model weights (stays relatively flat) - FLOPS: Handles calculations (grows linearly with batch size) The crossover point matters: - Below it? You're memory bandwidth limited - Above it? You're FLOPS limited For modern GPUs, you need hundreds of tokens to hit that sweet spot. Single token decoding is especially wasteful. If crossover is at 512 tokens, one decode step = 0.2% FLOPS utilization. That's throwing away 99.8% of your GPU's potential! 🤯 Through innovative engineering and optimization, Augment Code has achieved 25% FLOPS utilization - a massive leap forward in GPU efficiency. This breakthrough enables us to deliver snappy, context-rich code completions with exceptionally low latency. Here’s how we did it: https://lnkd.in/gVT5kQRg #MachineLearning #GPUOptimization #TechnicalOptimization #AI #Engineering #Innovation

Rethinking LLM Inference: Why Developer AI Needs a Different Approach

augmentcode.com
Like Comment
To view or add a comment, sign in
Gaurav Mantri

Director - Data Science at UnitedHealth Group
11mo Edited
Report this post
Conditional Computation in LLMs (MoEs): Understanding the Concept of MoEs The Mixture of Experts (MoE) works on the idea of conditional computation, allowing the model to only run through some parts (an Expert) for a given input compared to a dense model where all parameters are used. It consists of a gating mechanism that selects the experts for the given input, enabling the model to scale without increasing the required computation. Popular LLM architectures like Mixtral and Grok use the same concept. What is MoEs? The Mixture of Experts (MoE) makes a simple modification to the usual architecture by replacing all or some of the Feed Forward layers with an MoE layer. The MoE layer consists of: 1. Router: A parameter gating mechanism that selects the experts used to process the given input by producing a probability distribution over experts. 2. Experts: Standalone NN modules with independent sets of parameters trained jointly with routers. It can use more complex architecture to create hierarchical MoE modules by implementing each expert as another MoE. - Implementation in Switch Transformers : Training was found to be 7x faster than training the base T5, keeping computation and training time constant. - Gshard and MeshTransformer : These frameworks propose efficient methods to train the MoE architectures using the parallelisation of models across GPUs/TPUs. Challenges with MoEs and Solutions 1. Inconsistent Batch Size for Experts: Can be overcome with the use of "soft" constraints in training loss. 2. Load Balancing Across Experts: The issue of repeatedly using the same few experts while training can be overcome by including better balancing mechanisms. 3. Model Instability Issues: Use of router-z Loss as an auxiliary loss to encourage smaller values within the router mechanism. 4. Model Overfitting: Requires different hyperparameter settings for effective fine-tuning. It was found that models benefit from high learning rates and smaller batch sizes. Recommended Reading/References 1. Hugging Face Blog on MoE - https://lnkd.in/g36gQJTx 2. Cameron Wolfe's Substack on Conditional Computation - https://lnkd.in/gtpjMHnJ

Mixture of Experts Explained

huggingface.co

1 Comment
Like Comment
To view or add a comment, sign in
Rememberizer

73 followers
9mo
Report this post
𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐢𝐧𝐠 𝐭𝐡𝐞 𝐒𝐤𝐲𝐃𝐞𝐜𝐤 𝐕𝐞𝐜𝐭𝐨𝐫 𝐒𝐞𝐫𝐯𝐞𝐫: 𝐌𝐚𝐫𝐤 𝐈! At SkyDeck.ai our passion for vector embedding and RAG at scale drives us to innovate continually. That’s why we designed and build our own hardware. And now we are sharing that with you. Our vector database server is optimized for this task with just the right balance of NVMe SSD, SAS, Memory, CPU. As a result you get exceptional performance at a much lower cost than the equivalent system from vendors like Dell. Available as bare metal, or with Ubuntu Server and a pre-configured Postgresql and Pgvector vector database installed. 𝐷𝑒𝑠𝑖𝑔𝑛𝑒𝑑, 𝑏𝑢𝑖𝑙𝑡, 𝑎𝑛𝑑 𝑑𝑒𝑝𝑙𝑜𝑦𝑒𝑑 𝑖𝑛 𝑆𝑖𝑙𝑖𝑐𝑜𝑛 𝑉𝑎𝑙𝑙𝑒𝑦. Check out our latest blog on How we pick our embedding model https://lnkd.in/gZC86SD6 #vectorembedding #vectordatabase #VB #embed

How Do We Pick Our Embedding Model?

try.rememberizer.ai
Like Comment
To view or add a comment, sign in
Victor Vicente Pereira Torrealba

Certified Professional: Java, PL SQL, AWS, Oracle Cloud Platform Application Development 2021 Certified Specialist
8mo
Report this post
Deploying Large Language Models with OCI Data Science Using NVIDIA GPUs and Triton Inference Server. A usual workflow of deploying custom containers involves the following steps: 1) Prepare and register a model artifact. 2) Build a Triton image and push it to the OCI Container Registry (OCIR). 3) Deploy an image using a Data Science model deployment. 4) Generate tokens invoking predict endpoints. https://lnkd.in/esUJ_z9G

GPT-2 Text Generation with OCI Model Deployment GPU Based Inference

blogs.oracle.com
Like Comment
To view or add a comment, sign in
AI21 Labs

30,094 followers
9mo Edited
Report this post
A long context window, without a long *effective* context, is effectively a small context window. According to analysis from NVIDIA 'leaderboard models claim to offer long context windows with lightning latency, but upon thorough testing, their 'effective' context lengths, where the models still produce good results, are half of what’s advertised or less' —50% or less. And yet, "Jamba-Instruct offers an inference #contextwindow of 256,000 tokens and can actually use all those 256,000 tokens effectively," as noted by Or Dagan. To read more about how #LLM architecture is paving the way towards #AgenticAI, check out this article by Daniel Singer. #Agents #ContextWindow #Agentic #Jamba

Towards Agentic AI Workflows with New LLM Architectures, Jamba-Instruct

startuphub.ai
Like Comment
To view or add a comment, sign in
Roan Caws

RedTeam & GenAI
6mo
Report this post
This quantisation technique will cut inference costs for both enterprise and consumers, and will put larger models within reach of more modest hardware! I'm a big fan of quantisation and the described VPTQ is designed to allow 2-bit quantisation with minimal loss using lookup tables. This will allow you to run large models on undersized hardware. I have previously run 2-bit versions of Llama-3-70b which took up about 23GB VRAM. This technique would double token/s and improve accuracy greatly. I expect to see this integrated more in the near future. Catchy name too. https://lnkd.in/egzxCn_G

Papers with Code - VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

paperswithcode.com
Like Comment
To view or add a comment, sign in
Advait Joglekar

BS Data Science, IIT-Madras. AI Researcher @ IIT-Madras
2mo
Report this post
In case you missed it, I recently released a very robust, high quality, potentially State-of-the-Art Text-to-Speech model for Hindi on Hugging Face. It has quickly become popular in the Indian AI community amassing thousands of downloads. It's completely open source, including the data; all licensed with the extremely permissive CC-By-4.0. If you haven't tried it yet, you better. Model: https://lnkd.in/ddGMZ6iD

SPRINGLab/F5-Hindi-24KHz · Hugging Face

huggingface.co

3 Comments
Like Comment
To view or add a comment, sign in
Gary J. Kuehn SMIEEE, PMP, PMI-ACP

Senior Engineer; Project Manager : Retired
6mo
Report this post
Aria, a new model has just been introduced. What makes this noteworthy is Aria is a multimodal LLM fully open source. I do believe it is the first open source multimodal model presently in the wild. Some of the interesting early observations are that it isn’t exceptionally large. It makes use of a Mixture-of-Experts (MoE) framework engaging only those experts needed for the task at hand. There is the obvious metrics analysis which you can take with a grain of salt but on the surface the figures do indicate that this model can stand up to many of the more well-known models. The model has been released under an Apache 2.0 license allowing for adaptability and augmentation of the model if one desires. The model is free from HuggingFace but it does still necessitate some demanding memory and GPU requirements, 80GB VRAM and a couple of powerful GPUs. It is worth watching Rhymes and where this model goes. It is really nice to see proprietary models aren’t or at least will not be the only game in town. https://lnkd.in/gh_4zm_6

Meet Aria: The New Open Source Multimodal AI That's Rivaling Big Tech - Decrypt

decrypt.co
Like Comment
To view or add a comment, sign in
GSI Technology

3,522 followers
10mo
Report this post
Through massive parallel processing and flexible quantization, GSI Technology's APU reduces index build time by 85% compared to traditional CPU-based solutions. https://lnkd.in/guGqATFt

Efficient HNSW Indexing: Reducing Index Build Time Through Massive Parallelism

medium.com

1 Comment
Like Comment
To view or add a comment, sign in

666 followers

87 Posts

View Profile Connect

Ted Werbel’s Post

More Relevant Posts

Explore topics