Tagged "vllm"
- Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
- Elastic KV Cache Memory Breakthrough Enables Efficient Bursty LLM Serving and GPU Sharing
- Can IBM's RITS Platform and vLLM Reset the Bar for Enterprise AI Access?
- Build Your Own Local AI Stack with 5 Docker Containers and Eliminate ChatGPT Subscriptions
- I Built a Local AI Stack With 5 Docker Containers, and Now I'll Never Pay for ChatGPT Again
- Intel LLM-Scaler vLLM 0.14.0 Released With Official Arc Pro B70 Support
- AI Quota Inflation Is No Token Effort. It's Baked In
- Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
- Unweight: Lossless MLP Weight Compression for LLM Inference
- Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
- Researcher Discovers 221 Bugs in vLLM Stemming From Single Root Cause
- Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference
- DotLLM – Building an LLM Inference Engine in C#
- DGX Spark Setup Guide: Running vLLM and PyTorch for Local LLM Inference Backend
- OpenNebula 7.2 "Dark Horse" Released with Enhanced Infrastructure Support
- Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
- Warp Decode vs. vLLM's Triton Kernel: Performance Crossover Analysis
- Ollama's Limitations for Production Local LLM Deployments
- Speculative Decoding Made My Local LLM Actually Usable
- Hugging Face Moves Safetensors Under PyTorch Foundation
- Ollama is Still the Easiest Way to Start Local LLMs, But It's the Worst Way to Keep Running Them
- GPU Memory for LLM Inference (Part 1)
- Satsgate: Monetize AI Agents and APIs with Lightning L402 Protocol
- GPUs vs. TPUs: Decoding the Powerhouses of AI
- 5 Useful Docker Containers for Agentic Developers
- OpenUMA – Apple-Style Unified Memory for x86 AI Inference
- NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs
- Intel's $949 GPU Has 32GB of VRAM for Local AI, but Software is Why Nvidia Keeps Winning
- ROCm Integration in Ubuntu 26.04 Advances Linux GPU Inference
- Local AI Ecosystem Extends Far Beyond Ollama
- Gemini CLI – Open-Source AI Agent for Terminal Integration
- Is Anyone Working on an AI Operating System?
- Samsung launches Galaxy Book6 series in India with Nvidia RTX 5070 graphics and on-device AI
- Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning
- RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- Pluggable's TBT5-AI: First Thunderbolt Dock Explicitly Targeting Local LLM Workstations
- Researcher Successfully Runs Local LLMs on Legacy "Dead" GPU With Surprising Results
- Developer Builds Fully Local Multi-Agent System Using vLLM and Parallel Inference
- Build a $1,500 AI Server with DeepSeek-R1 on RTX 4090
- Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
- LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
- Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
- OpenClaw vs Eigent vs Claude Cowork: Comparing Open-Source AI Collaboration Platforms
- AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
- P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
- Runpod Report: Qwen Has Overtaken Meta's Llama As The Most-Deployed Self-Hosted LLM
- Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models
- How to Install OpenClaw with Ollama (Step-by-Step Tutorial)
- Nvidia Pushes Jetson as Edge Hub for Open AI Models
- Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
- Show HN: Aver – a Language Designed for AI to Write and Humans to Review
- Nemotron 9B Powers Large-Scale Local Inference: Patent Classification and Real-Time Applications
- HP Refreshes Lineup with AI-Focused Workstations
- Intel Arc Pro B70 Workstation GPU Confirmed via vLLM AI Release Notes
- Framework Choice Critical: llama.cpp and vLLM Outperform Ollama for Qwen 3.5 Testing
- AMD Expands Ryzen AI 400 Series Portfolio for Consumer and Enterprise AI PC Options
- Huawei's SuperPoD Portfolio Creates New Option for Global Computing at MWC Barcelona 2026
- DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
- DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
- Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM
- Self-Hosted AI: A Complete Roadmap for Beginners
- Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter
- High Bandwidth Flash Memory Could Alleviate VRAM Constraints in Local LLM Inference
- Critical vLLM RCE Vulnerability Allows Remote Code Execution via Video Links
- OpenClaw with vLLM Running for Free on AMD Developer Cloud
- Heaps Do Lie: Debugging a Memory Leak in vLLM
- Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine