Tagged "vllm"

NVIDIA Levels Up Local AI Agents Across RTX PCs and DGX Spark 1 June 2026
Microsoft and Nvidia to Unveil First Windows PCs with Nvidia CPUs and AI Capabilities 31 May 2026
Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference 27 May 2026
vLLM vs Ollama 2026: Performance Benchmark Reveals 9x Throughput Gap 25 May 2026
How to Self-Host LibreChat with Docker 23 May 2026
AMD Unveils Ryzen AI Halo Developer Platform for On-Device AI Workloads 23 May 2026
Deploying Hermes Agent for Free on AMD Developer Cloud with Open Models and vLLM 22 May 2026
Intel llm-scaler-vllm 1.4 Released With Updated Components and Arc Pro B70 Support 21 May 2026
Local LLMs Offer Unique Advantages That Cloud AI Services Cannot Match 18 May 2026
Linux 7.1-rc4 Released: Kernel Updates Relevant to Local LLM Inference 18 May 2026
AMD's Lemonade SDK Advances macOS Support for Local AI Inference with ROCm 7.13 18 May 2026
The AI Layoff Receipts: Market Consolidation Accelerates Open-Source Model Adoption 18 May 2026
Google Limits Gemini Intelligence to New Flagships—Hardware Requirements for Local Deployment 17 May 2026
SynapseKit: A New Production Framework for Deploying LLMs 16 May 2026
Orthrus Reshapes Economics of Local AI Inference with New Optimization Approach 16 May 2026
Local LLM Persistent Context Prevents Repetitive Mistakes 14 May 2026
Lucebox Brings Faster Local AI Inference to AMD Strix Halo 13 May 2026
AMD's vLLM-ATOM Plugin Supercharges DeepSeek-R1 and Kimi-K2 Inference on MI350/MI400 12 May 2026
DFlash Speculative Decoding Delivers 8.5x Speed Improvement for LLM Inference 11 May 2026
Supercharging LLM Inference on Google TPUs: Achieving 3X Speedups With Diffusion-Style Speculative Decoding 5 May 2026
Ubuntu is Going All In on Generative AI and Other Linux Distros Might Follow 1 May 2026
Linux Setup for Local LLMs Takes Minutes Compared to Windows Hours 1 May 2026
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful 28 April 2026
Elastic KV Cache Memory Breakthrough Enables Efficient Bursty LLM Serving and GPU Sharing 26 April 2026
Can IBM's RITS Platform and vLLM Reset the Bar for Enterprise AI Access? 26 April 2026
Build Your Own Local AI Stack with 5 Docker Containers and Eliminate ChatGPT Subscriptions 25 April 2026
I Built a Local AI Stack With 5 Docker Containers, and Now I'll Never Pay for ChatGPT Again 24 April 2026
Intel LLM-Scaler vLLM 0.14.0 Released With Official Arc Pro B70 Support 23 April 2026
AI Quota Inflation Is No Token Effort. It's Baked In 20 April 2026
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful 19 April 2026
Unweight: Lossless MLP Weight Compression for LLM Inference 18 April 2026
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation 18 April 2026
Researcher Discovers 221 Bugs in vLLM Stemming From Single Root Cause 16 April 2026
Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference 16 April 2026
DotLLM – Building an LLM Inference Engine in C# 15 April 2026
DGX Spark Setup Guide: Running vLLM and PyTorch for Local LLM Inference Backend 15 April 2026
OpenNebula 7.2 "Dark Horse" Released with Enhanced Infrastructure Support 14 April 2026
Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B 11 April 2026
Warp Decode vs. vLLM's Triton Kernel: Performance Crossover Analysis 10 April 2026
Ollama's Limitations for Production Local LLM Deployments 10 April 2026
Speculative Decoding Made My Local LLM Actually Usable 9 April 2026
Hugging Face Moves Safetensors Under PyTorch Foundation 9 April 2026
Ollama is Still the Easiest Way to Start Local LLMs, But It's the Worst Way to Keep Running Them 9 April 2026
GPU Memory for LLM Inference (Part 1) 6 April 2026
Satsgate: Monetize AI Agents and APIs with Lightning L402 Protocol 5 April 2026
GPUs vs. TPUs: Decoding the Powerhouses of AI 4 April 2026
5 Useful Docker Containers for Agentic Developers 4 April 2026
OpenUMA – Apple-Style Unified Memory for x86 AI Inference 3 April 2026
NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs 3 April 2026
Intel's $949 GPU Has 32GB of VRAM for Local AI, but Software is Why Nvidia Keeps Winning 2 April 2026
ROCm Integration in Ubuntu 26.04 Advances Linux GPU Inference 1 April 2026
Local AI Ecosystem Extends Far Beyond Ollama 1 April 2026
Gemini CLI – Open-Source AI Agent for Terminal Integration 1 April 2026
Is Anyone Working on an AI Operating System? 1 April 2026
Samsung launches Galaxy Book6 series in India with Nvidia RTX 5070 graphics and on-device AI 31 March 2026
Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning 31 March 2026
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra 27 March 2026
Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config 27 March 2026
Pluggable's TBT5-AI: First Thunderbolt Dock Explicitly Targeting Local LLM Workstations 26 March 2026
Researcher Successfully Runs Local LLMs on Legacy "Dead" GPU With Surprising Results 25 March 2026
Developer Builds Fully Local Multi-Agent System Using vLLM and Parallel Inference 22 March 2026
Build a $1,500 AI Server with DeepSeek-R1 on RTX 4090 21 March 2026
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models 20 March 2026
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform 20 March 2026
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead 17 March 2026
OpenClaw vs Eigent vs Claude Cowork: Comparing Open-Source AI Collaboration Platforms 15 March 2026
AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon 15 March 2026
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM 14 March 2026
Runpod Report: Qwen Has Overtaken Meta's Llama As The Most-Deployed Self-Hosted LLM 13 March 2026
Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models 13 March 2026
How to Install OpenClaw with Ollama (Step-by-Step Tutorial) 13 March 2026
Nvidia Pushes Jetson as Edge Hub for Open AI Models 12 March 2026
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia 12 March 2026
Show HN: Aver – a Language Designed for AI to Write and Humans to Review 11 March 2026
Nemotron 9B Powers Large-Scale Local Inference: Patent Classification and Real-Time Applications 9 March 2026
HP Refreshes Lineup with AI-Focused Workstations 8 March 2026
Intel Arc Pro B70 Workstation GPU Confirmed via vLLM AI Release Notes 3 March 2026
Framework Choice Critical: llama.cpp and vLLM Outperform Ollama for Qwen 3.5 Testing 3 March 2026
AMD Expands Ryzen AI 400 Series Portfolio for Consumer and Enterprise AI PC Options 2 March 2026
Huawei's SuperPoD Portfolio Creates New Option for Global Computing at MWC Barcelona 2026 1 March 2026
DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference 26 February 2026
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference 26 February 2026
Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers 24 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM 19 February 2026
Self-Hosted AI: A Complete Roadmap for Beginners 17 February 2026
Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter 17 February 2026
High Bandwidth Flash Memory Could Alleviate VRAM Constraints in Local LLM Inference 17 February 2026
Critical vLLM RCE Vulnerability Allows Remote Code Execution via Video Links 14 February 2026
OpenClaw with vLLM Running for Free on AMD Developer Cloud 12 February 2026
Heaps Do Lie: Debugging a Memory Leak in vLLM 12 February 2026
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine 11 February 2026