GPU Programming Insights

Explore top LinkedIn content from expert professionals.

  • View profile for Steve Nouri

    The largest AI Community 14 Million Members | Advisor @ Fortune 500 | Keynote Speaker

    1,735,931 followers

    🚀 DeepSeek Just Dropped 3 Powerful Open-Source Releases – Here’s Why They Matter They’re rewriting the rulebook on efficient LLM training and deployment. Today, they open-sourced three incredibly small (yet powerful) repositories, each addressing a key bottleneck in large-scale AI infrastructure.👇 1️⃣ Profiling Data for AI Training Efficiency On the surface, this might not seem groundbreaking, but this dataset is a goldmine. It provides a real-world breakdown of how DeepSeek keeps GPUs fully utilized during training and inference, ensuring that every single compute cycle contributes to efficiency. ✅ Optimized scheduling = faster, cheaper AI training ✅ Helps teams visualize GPU workload distribution (viewable in Chrome tracing tools) ✅ A rare, transparent look into state-of-the-art AI scaling techniques I wish more open-source teams would release this kind of data, because training efficiency is the #1 challenge at massive scales. 2️⃣ Load Balancing for Mixture of Experts (MoE) Mixture of Experts (MoE) is a major reason why AI models can scale efficiently, but there’s always been one major problem: some GPUs get overloaded while others sit idle. DeepSeek’s Expert Parallelism Load Balancer (EPLB) solves this by: ✅ Duplicating and redistributing heavyloaded experts across GPUs ✅ Minimizing internode traffic, reducing delays ✅ Ensuring balanced workloads, preventing bottlenecks This is huge! MoE models are notoriously tricky to optimize, and this tool simplifies deployment for anyone working with expert-based architectures. If you’re serious about scaling efficient MoE models, this is an absolute must-try. 3️⃣ The Game-Changer: DualPipe – Zero-Bubble Parallelism 🔥 This is THE most exciting part of today’s release. Pipeline Parallelism (PP) is used to split LLM training across GPUs, but it comes with inefficiencies—idle time (bubbles) between forward and backward passes. DualPipe eliminates these bubbles, achieving a “zero-bubble regime” for the first time ever in large-scale AI training. 💡 Why this is huge? - Full computation-communication overlap (no wasted cycles) - Reduces training time and cost significantly - First-of-its-kind implementation, never reported before in SOTA training If you work with distributed AI training, this could dramatically improve efficiency and lower costs across the board. Final Thoughts DeepSeek is doing open-source right. Instead of just releasing models, they’re sharing the critical tools and techniques that power SOTA AI training. - GPU efficiency matters, profiling data like this is rare and invaluable. - Mixture of Experts isn’t magic, it needs proper balancing. EPLB makes it easy. - Zero-bubble training is a reality. DualPipe might become the new standard! How do you see AI training evolving? links in the comments.

  • View profile for Paolo Perrone

    Shipping Production AI: Agents, Inference, GPU. Read by 1M+ AI engineers.

    133,074 followers

    "You're learning CUDA all wrong," the NVIDIA engineer said Then he showed me their internal training path "Wait, you DON'T start with code?" Here's the exact 90-day roadmap they use👇 Phase 1️⃣ Intuition (Week 1-2) Don't touch CUDA yet. Seriously Build your mental model of the hardware and the why first ▶︎ UC Berkeley CS 61C, Lecture 17 This is the physics layer. Understand why GPU differs from a CPU 🔗 https://lnkd.in/gVi6Bsut ▶︎ Coursera Parallel Computing Course (First 3 modules only) Learn parallel algorithms and thinking 🔗 https://lnkd.in/g4FtxbE5 ▶︎ Stanford CS231n Lecture 15 - Hardware/Software interface See how frameworks like PyTorch use hardware for AI 🔗 https://lnkd.in/gzaR7xrZ Phase 2️⃣ CUDA Basics (Week 3-4) Now we code ▶︎ NVIDIA's official CUDA C++ Programming Guide (Chapters 1-5 only) Learn threads, blocks, grids and kernel structure 🔗 https://lnkd.in/gsZsEqPp ▶︎ cuda-samples repo Reading isn't enough. Compile, run, and modify official NVIDIA examples 🔗 https://lnkd.in/gGRgvm7G ```cuda __global__ void vectorAdd(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } ``` If this doesn't make sense yet, you skipped Phase 1 Phase 3️⃣ Memory Mastery (Week 5-8) Where 90% of developers fail, and where all performance hides ▶︎ Mark Harris's GTC Talk on Coalesced Memory Access Single most important CUDA performance concept Learn how threads must access global memory in aligned groups 🔗 https://lnkd.in/gz6Nbe5H ▶︎ GPU Gems 3, Chapter 39 - "Parallel Prefix Sum with CUDA" Masterclass in shared memory to avoid bank conflicts, a fundamental optimization 🔗 https://lnkd.in/gNhZRCHE ▶︎ CUDA C++ Best Practices Guide - "Memory Optimizations" Chapter Read to understand Global, Shared, Constant, Texture memory models 🔗 https://lnkd.in/grbhz7_V Phase 4️⃣ Real Kernels (Week 9-12) Stop playing with toy arrays. Build something that matters • Implement softmax (harder than you think) • Write a basic GEMM that doesn't suck • Port one PyTorch operation to CUDA Repos that ship: ▶︎ tiny-cuda-nn by NVIDIA Goldmine of highly optimized, real-world kernels for NN 🔗 https://lnkd.in/gGbFzVsb ▶︎ FlashAttention Reading this code teaches more on memory-aware kernel design than any book 🔗 https://lnkd.in/g6sMnBsC ▶︎ Triton Language Examples Modern, Pythonic way to write efficient GPU code, simplifying raw CUDA boilerplate 🔗 github.com/openai/triton ⚡ NVIDIA engineers 6-month shortcut Skip CUDA Learn Triton first (handles 80% of use cases better) Then return to CUDA when hitting limits The difference between you and everyone else? You have the map 90 days from now, you'll be shipping production kernels Not stuck debugging tutorials ♻️ Repost to give someone the shortcut you wish you had

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    232,578 followers

    Couple of weeks ago, amongst other things I called out that DeepSeek AI’s FlashMLA announced a suite of efficiency solutions that will improve AI workload GPU utilization, with increased speed. 🔸TLDR: It’s fascinating to see such quick innovations in CUDA programming right after DeepSeek, aiming to achieve substantial efficiency gains in variable-length prompt processing and small-batch inference scenarios. 🔹As such, Stanford researchers soft launched ThunderMLA, an optimized GPU decoding mechanism designed to accelerate large language model inference by implementing a fully fused “megakernel” for attention decoding. 🔹In other words, this megakernel consolidates multiple kernel operations into a single execution unit, reducing the overhead associated with individual kernel launches, such as setup and teardown times, while mitigating tail effects and improving memory bandwidth utilization. 🔹By leveraging custom scheduling strategies, including static and makespan-backward schedulers, ThunderMLA optimizes task execution order and resource allocation, achieving a 20-35% speedup over FlashMLA. 🔹Behind this performance gain, we find ThunderKittens, an embedded domain-specific language (DSL) developed by the researchers. It simplifies writing high-performance AI kernels for GPUs. 🔹Thunderkittens maintains extensibility and uses fundamental objects that align with tensor cores for optimal utilization, while abstracting complex GPU programming tasks. 🔹It provides a PyTorch-like API, making it accessible while remaining hardware-transparent for developers needing fine-grained control. Looking forward to the technical report, as well as an extension of this Multi-Head Latent Attention speed up to other areas. I’ll be glad to share it! See more below #genai #technology #artificialintelligence

  • View profile for Karu Sankaralingam

    Principal Research Scientist at NVIDIA and Professor at UW Madison

    4,690 followers

    Excited to share our latest research paper, Kitsune, which tackles a fundamental challenge in GPU architecture. I’ve worked on dataflow architectures in various forms throughout my career, so it is deeply satisfying to demonstrate a method for orchestrating dataflow on one of the world's most ubiquitous silicon solutions: the GPU. The research asks a critical question: "Can modest adjustments to the current GPU architecture enable efficient dataflow execution, thereby circumventing the constraints of vertical fusion without necessitating a clean-slate architecture design?" The answer, we found, lies in a surprisingly elegant solution. The heart of our idea is an ultra-fast producer/consumer queue. By implementing this via a software-only ring queue and a modest grid scheduler adjustment, we can unlock efficient dataflow execution without abandoning the hardware and established software codebase we already have. Check out the full paper here: https://lnkd.in/gnQFvhup Great working with Michael Davies and Neal Crago on this!

  • View profile for Yangqing Jia

    Co-founder & CEO of Lepton AI (now part of NVidia). Hiring top talents.

    9,795 followers

    People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

  • View profile for Emilio Andere

    Co-Founder and CEO at Wafer - Hardware Acceleration for AI

    16,001 followers

    nvidia now releases its most optimized inference kernels through a PhD student's open-source project. here's a breakdown of FlashInfer: FlashInfer is a GPU kernel library built specifically for LLM serving. it won Best Paper at MLSys 2025, powers both SGLang and vLLM, and NVIDIA now actively ships TensorRT-LLM kernels through it. the creator, Zihao Ye, built it during his PhD at UW and now works at NVIDIA full-time. LLM serving has a combinatorial explosion of attention kernels. every combination of KV-cache layout (paged, radix tree, tree masks), attention variant (GQA, MLA, RoPE-fused, sliding window), and batch mode (prefill, decode, append, shared prefix) needs a different kernel. FlashInfer's insight was: all KV-cache layouts are special cases of block-sparse matrices. paged attention is just block-sparse with page_size as block width. radix tree? block-sparse. tree attention for speculative decoding? block-sparse. one abstraction can replace what used to be separate kernel implementations. then you get JIT compilation to handle the variant explosion, in the form of CUDA/CUTLASS templates that get specialized at runtime there's two other major innovations built on top of FlashInfer: 1. cascade attention when multiple requests share a prefix (document QA, system prompts), FlashInfer decomposes attention into two stages: a multi-query kernel for the shared prefix (loaded once into SMEM, reused across all queries) and a batch decode kernel for unique suffixes. results merge using an associative operator on partial attention states. 31x speedup over vLLM's PagedAttention for 32K-token shared prefixes at batch size 256. 2. plan/run scheduling for CUDAGraph LLM serving has dynamic sequence lengths. CUDAGraphs need static configurations. FlashInfer solves this with a two-phase pattern: plan() inspects request shapes and computes balanced scheduling metadata, run() launches kernels. you plan once per decode step, then replay across all transformer layers. FlashInfer is an amazing project that i deeply respect, so also want to share some links for anyone that wants to go deeper: - paper (MLSys 2025 Best Paper): https://lnkd.in/gc_CTbnf - github: https://lnkd.in/gwfQ8B72 - NVIDIA blog: https://lnkd.in/gzs_uquk - cascade attention deep dive: https://lnkd.in/gHGqdNTV - docs: https://docs.flashinfer.ai

  • View profile for Charles Frye

    Building useful technology with large neural networks

    3,973 followers

    The Rise of Python in NVIDIA's CUDA Ecosystem: A Paradigm Shift at GTC 2025 At this year's GTC, one thing became crystal clear: we've entered the "year of CUDA Python." The shift is more strategic than you might think. Stephen Jones, lead architect of the CUDA ecosystem, spent his talks showcasing ways to avoid writing traditional CUDA code. The highlight? cuTile - a new high-performance kernel library that ships exclusively with a Python interface. Not only is no C/C++ layer required -- it's not even available! What's driving this Python-first approach? In my estimation, it's the increasing complexity of properly programming Tensor Cores. As NVIDIA's hardware capabilities advance, the traditional CUDA programming model struggles to efficiently harness these powerful components - yet using them effectively is essential to justify the hardware investment. Even the CUTLASS project (NVIDIA's home for their highest-performance kernels) has completely reimagined its Python interface in version 4. This isn't a shallow wrapper - it's a comprehensive redesign that slashes compile times from minutes to mere seconds. The message is clear: NVIDIA recognizes that accessibility and developer experience are now as important as raw performance. Python's approachability opens GPU programming to a much wider audience while new abstractions help manage the growing complexity of modern GPU architectures. What do you think about this Python-first direction? Will it successfully democratize high-performance GPU programming? #CUDA #Python #GPU #NVIDIA #TensorCores #Engineering #DeveloperExperience

  • View profile for Hao Hoang

    I share daily insights on AI agents, LLMs, Data Science, Machine Learning | I help AI engineers crack top-tier interviews | 66K+ community | LLM System Design, RAG, Agents

    65,440 followers

    I just trimmed 25% off my Qwen3-14B QLoRA run. Same GPU. Same code. One `pip install -U`. The Unsloth AI team shipped a collab with NVIDIA that fixes three things most training stacks were quietly bleeding time on. No new model. No accuracy hit. No hyperparameter tuning. Here's what each fix is actually doing under the hood: 1️⃣ 𝐂𝐚𝐜𝐡𝐞𝐝 𝐩𝐚𝐜𝐤𝐞𝐝-𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐞 𝐦𝐞𝐭𝐚𝐝𝐚𝐭𝐚 Every transformer layer was rebuilding the same boundary info (cu_seqlens, max_seqlen, mask structure) and forcing a GPU-CPU sync per layer. Now it's built once per batch, reused L times. +43.3% forward, +14.3% per batch on Qwen3-14B QLoRA SFT. 2️⃣ 𝐃𝐨𝐮𝐛𝐥𝐞-𝐛𝐮𝐟𝐟𝐞𝐫𝐞𝐝 𝐜𝐡𝐞𝐜𝐤𝐩𝐨𝐢𝐧𝐭 𝐫𝐞𝐥𝐨𝐚𝐝𝐬 Activation reloads from pinned CPU were serializing on a single buffer, copy, wait, compute, next copy. Two buffers run copy + compute in parallel. +8.4% on 8B, +6.7% on 14B, +4.6% on 32B. Memory overhead stays under 0.5 GB. 3️⃣ 𝐀𝐫𝐠𝐬𝐨𝐫𝐭 + 𝐛𝐢𝐧𝐜𝐨𝐮𝐧𝐭 𝐌𝐨𝐄 𝐫𝐨𝐮𝐭𝐢𝐧𝐠 The naive torch.where(router_indices == expert_idx) loop was triggering one CPU-GPU sync per expert. One stable sort, one bincount, reuse offsets everywhere. +23% forward on GPT-OSS routing path. The pattern across all three: the math kernels were already fast. The bottleneck was glue code, rebuilding metadata, serializing copies, querying the runtime once per expert. Group once. Cache once. Overlap the rest. Auto-enabled on RTX laptops, B200 data center GPUs, and DGX Spark. Apache 2.0. Zero accuracy loss. If you train models, this is one update away. Link in the comments 👇

  • View profile for Pascal Biese

    AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

    85,602 followers

    AI just delivered a computation breakthrough: Translating PyTorch to CUDA isn’t just a human problem anymore. Modern AI relies on GPU-optimized CUDA kernels, but handcrafting these requires rare expertise spanning algorithms, hardware, and memory hierarchies. This bottleneck now has a scalable solution: The AI CUDA Engineer. Sakana AI’s new framework uses Large Language Models (LLMs) to convert PyTorch operations into correct CUDA kernels and evolutionary optimization to iteratively maximize runtime efficiency. Key innovations: 1. Automatic translation (91% success rate) via error feedback loops 2. LLM-guided evolution combining model-generated variants with profiling data 2. Innovation Archive—a repository of 17K optimized kernels that seed future optimizations via RAG The results? A median 1.52x speedup over native PyTorch, with extreme gains like 54x faster diagonal matrix multiplications. Their system even translated and optimized full ResNet architectures into CUDA, achieving 1.44x speedups via fused shared-memory kernels. Why this matters: LLMs are moving beyond code generation to optimization—mastering hardware-specific constraints without human priors. With models writing code for 72% of PyTorch operations faster than torch.compile, democratizing GPU programming is no longer hypothetical. It's open for everyone: you can explore their open-sourced kernels or probe limitations 𝘳𝘪𝘨𝘩𝘵 𝘯𝘰𝘸. For industries like agriculture seeking location-specific AI—or anyone battling CUDA complexity—automating kernel engineering might just be the compute multiplier you need. Fore more on the AI CUDA Engineer and other AI highlights, check out this week's LLM Watch: https://lnkd.in/dfPZhpt6

  • View profile for Daily Papers

    Machine Learning Engineer at Hugging Face

    13,261 followers

    Training large language models typically means renting expensive A100s or H100s. But what if you could fine-tune a 32B parameter model on a single RTX 4090 instead? Researchers from Wuhan University and Peking University just released RoundPipe, a new training framework that makes this practical. Pipeline parallelism on consumer GPUs has always struggled with the "weight binding" problem, where uneven model stages create idle bubbles that waste precious VRAM and compute. RoundPipe breaks this constraint by treating GPUs as a pool of stateless workers, dynamically dispatching computation in a round-robin fashion to achieve near-zero pipeline bubbles. The results are striking. On an 8× RTX 4090 server, RoundPipe delivers 1.48–2.16× speedups over existing approaches. It enables full fine-tuning of 32B models—or LoRA fine-tuning of models up to 235B parameters—with sequence lengths exceeding 64K tokens on just 24GB of VRAM. Best of all, it feels like vanilla PyTorch. There is no complex parallel programming to learn, no training loop rewrites required for multi-GPU scaling, and it runs on NVIDIA, AMD, and Ascend hardware alike. Installation is as simple as pip install roundpipe. For researchers and developers working outside of hyperscaler budgets, this significantly lowers the barrier to training production-scale models. Paper: https://lnkd.in/ejc7-RNT Code: https://lnkd.in/eFTzZ4Tw Documentation: https://lnkd.in/eCgcHRWS

Explore categories