Tagged "consumer-gpu"
-
Building a Local AI Stack: Five Docker Containers to Replace ChatGPT Subscriptions
-
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
-
Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp
-
Google's Gemma 4: Powerful AI Models Optimized for Your Phone and Laptop
-
Economic Implications of AI Adoption: Why Local Deployment Matters for Cost Control
-
Unsloth's Custom Kernels Make LLM Fine-Tuning Viable on Consumer GPUs
-
Google's Gemma 4 Could Put Powerful AI on Your Phone and Laptop
-
Pluggable's TBT5-AI: First Thunderbolt Dock Explicitly Targeting Local LLM Workstations
-
Elastic KV Cache Memory Breakthrough Enables Efficient Bursty LLM Serving and GPU Sharing
-
Google's Gemma 4 Could Put Powerful AI on Your Phone and Laptop
-
Fixing Hallucination in LLM Prediction With Only One 48GB GPU
-
GPU Passthrough to LXCs in Proxmox Outperforms VMs and Simplifies Local AI Infrastructure
-
Google's Gemma 4 Brings Powerful On-Device AI to Phones and Laptops
-
I Replaced My Local LLM With a Model Half Its Size and Got Better Results
-
Show HN: We built an OCR server that can process 270 dense images/s on a 5090
-
Llama 4 Scout on MLX: The Complete Apple Silicon Guide (2026)
-
Intel OpenVINO 2026.1 Integrates llama.cpp with Wildcat Lake and Arc Pro B70
-
Intel LLM-Scaler vLLM 0.14.0 Released With Official Arc Pro B70 Support
-
Externalization in LLM Agents: Unified Review of Memory and Harness Engineering
-
10GB VRAM Local LLM: The Complete Setup Guide (2026)
-
Llama.cpp's Auto Fit Feature Quietly Reshapes Local AI Inference on Consumer Hardware
-
Google's Gemma 4 Finally Makes Local LLM Deployment Compelling for Practitioners
-
The Open-Source AI Ecosystem Keeps Treating llama.cpp Like a Second-Class Citizen
-
ZeusHammer: Built an AI Agent That Thinks Locally
-
llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost
-
Intel Extends AI PC Reach With New Core Ultra Series 3 Launch
-
Running DeepSeek R1 Locally: Your Complete Setup Guide
-
PCMind: Local AI Analysis of Docs, Audio, Video and Images
-
Gemma 4 Just Replaced My Whole Local LLM Stack
-
Unweight: Lossless MLP Weight Compression for LLM Inference
-
We Built a Local Model Arena in 30 Minutes — Infrastructure Mattered More Than the App
-
Laimark – 8B LLM That Self-Improves on Consumer GPUs
-
Show HN: I Can't Write Python. It Works Anyway – Local LLM Automation
-
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
-
Intel's $949 GPU Has 32GB of VRAM for Local AI, but the Software Is Why Nvidia Keeps Winning
-
Community Computer: Collaborative Autoresearch on a Peer-to-Peer Network
-
Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference
-
Google's Gemma 4: The Most Practical Local LLM Despite Not Being The Smartest
-
SigMap – Shrink AI Coding Context 97% with Auto-Scaling Token Budget
-
Noi Enables Running ChatGPT and Claude Side-by-Side on Your Desktop
-
Dynamic Expert Cache in llama.cpp Achieves 27% Faster Inference on Large MoE Models
-
GPU Passthrough to LXCs in Proxmox Simplifies Local Inference Infrastructure
-
Google's Gemma 4 Brings Game-Changing Performance to Local Laptop Inference
-
Sovereign AI: Why the Next GPT Will Be Born in Our Living Rooms
-
Qwen 3.5 Small – On-Device Multimodal Models Released
-
MiniMax M2.7 Achieves SOTA Performance Under 64GB on Mac with TQ Quantization
-
Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
-
Qwen3 Audio and Vision Support Now Available in llama.cpp
-
MiniMax-M2.7 Delivers Exceptional Performance on Consumer Hardware
-
Audio Processing Support Lands in llama.cpp with Gemma-4
-
Unsloth Completes Comprehensive MiniMax M2.7 GGUF Quantization Suite
-
A Deep Dive into Tinygrad AI Compiler
-
On-Device AI: Achieving Powerful AI Capabilities Without Internet Connectivity
-
MiniMax M2.7 Released: New Model Available for Local Deployment
-
MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for Complex AI Applications
-
Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference
-
DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon
-
Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
-
Gemma 4 31B vs Qwen 3.5 27B: Comprehensive Long Context Benchmark
-
ASUS ExpertBook P1 Integrates On-Device AI for Enterprise Collaboration
-
AIYO Wisper: Local Voice-to-Text for macOS Using WhisperKit
-
Warp Decode vs. vLLM's Triton Kernel: Performance Crossover Analysis
-
Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs
-
5 Open-Source Projects Running Transformers on CPUs to GPUs in Pure Java
-
Energy Consumption: The Final Frontier for AI and Local Inference
-
VoxCPM2: New Open-Source TTS Model with Voice Cloning and Design
-
Speculative Decoding Made My Local LLM Actually Usable
-
I Replaced My Local LLM With a Model Half Its Size and Got Better Results — and It Wasn't About the Parameters
-
Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support
-
Gemma 4 Support Stabilized in Llama.cpp
-
Gemma 4 GGUF Models Updated with Critical Quantization Fixes
-
EXAONE 4.5 33B Model Released with Multiple Quantization Formats
-
Google's Gemma 4 Brings Powerful On-Device AI to Android and iOS
-
Running AI Natively on Windows 11 Using an eGPU
-
Quansloth Using Google's Turboquant Breaks the VRAM Wall for Local LLMs
-
Your Next Assistant is Your PC: How On-Device AI is Transforming Work, One Workflow at a Time
-
TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration
-
Gemma 4 26B Achieves Impressive Local Performance With Proper Configuration
-
AMD Announces Day 0 Support for Google Gemma 4 Across Processors and GPUs
-
Verbatim 140W GAN: One of the First Chargers With USB PD 3.2 AVS (SPR) Support
-
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
-
Quantization Strategy Comparison: Balancing Quality and Speed on Consumer Laptops
-
Context Window Optimization: Extending Gemma 4 Context Length Through Efficient Projection Quantization
-
Show HN: Lightweight LLM Tracing Tool with CLI
-
HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware
-
GPU Memory for LLM Inference (Part 1)
-
Google AI Edge Gallery Tops App Store Charts with On-Device Gemma 4
-
Real-time Multimodal AI on Apple Silicon: Gemma E2B Demo Shows Practical Edge Deployment
-
Gemma 4 31B Achieves Exceptional Performance on Local Hardware
-
Show HN: Turn Photos Into Wordle Puzzles with AI That Runs 100% in Your Browser
-
Qwen 3.5 397B Reduced to 35% Parameters With Usable Quality on 96GB GPU
-
DGX Spark Hardware Limitations: Missing NVFP4 Support Undermines Local AI Value Proposition
-
Gemma 4 31B Achieves Third Place on FoodTruck Bench, Beating Larger Models
-
Gemma 4 26B MoE Emerges as Optimal All-Around Local Model for Consumer Hardware
-
Samsung Launches Galaxy Book6 Series with NVIDIA RTX 5070 and On-Device AI
-
NVIDIA and Google Optimize Gemma 4 AI Models for Local RTX Deployment
-
GPUs vs. TPUs: Decoding the Powerhouses of AI
-
Google Launches Gemma 4 For Advanced On-Device AI
-
Gemma 4 31B Outperforms GLM 5.1 in Real-World Testing
-
Gemma 4 KV Cache Memory Issues Fixed in llama.cpp
-
AMD Rolls Out Gemma 4 Model Support Across Full Range of GPUs & CPUs
-
SkillCompass – Diagnose and Improve AI Agent Skills Across 6 Dimensions
-
OpenUMA – Apple-Style Unified Memory for x86 AI Inference
-
NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs
-
VRAM Optimization Technique Cuts Gemma 4 Memory Usage by 3x
-
Google Gemma 4 Released with GGUF Quantizations
-
Google Launches Gemma 4 Open Models for Local On-Device AI
-
Gemma 4 Makes Local AI Agents Practical
-
AMD Provides Day 0 Support for Gemma 4 on Ryzen AI Processors and GPUs
-
TurboQuant Enables Qwen 3.5-27B on 16GB Consumer GPUs
-
Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support
-
TinyGPU Adds Mac Support for External Nvidia GPU Acceleration
-
Intel's $949 GPU Has 32GB of VRAM for Local AI, but Software is Why Nvidia Keeps Winning
-
Show HN: Extra-Platforms, Python Library to Detect OS, Arch, Shell, CI, AI
-
Bonsai 1-Bit Models Deliver Exceptional Local Inference Performance
-
ROCm Integration in Ubuntu 26.04 Advances Linux GPU Inference
-
Qwen 3.5-27B Demonstrates Superior Performance vs Gemini 3.1 Pro and GPT-5.3
-
Intel's Arc GPU Offers 32GB VRAM for Local AI, But Software Ecosystem Lags Behind
-
ByteShape Releases Qwen 3.5 9B Quantisations with Hardware-Matched Tuning Guide
-
Is Anyone Working on an AI Operating System?
-
Samsung launches Galaxy Book6 series in India with Nvidia RTX 5070 graphics and on-device AI
-
Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning
-
Select the Right Hardware for Your Local LLM Deployment with This Online Guide
-
Samsung Launches Galaxy Book6 Series in India with NVIDIA RTX 5070 Graphics and On-Device AI
-
Dell Technologies Unveils 10 AI PC Models for Business, from Ultralight Laptops to Ultracompact Desktops
-
DeepSeek V3 Complete Guide: Deploy and Optimize Local AI in 2026
-
TurboQuant: Understanding the Quantization Breakthrough
-
Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference
-
Scion: Running Concurrent LLM Agents with Isolated Identities and Workspaces
-
Samsung Galaxy Book6 Brings Consumer-Grade On-Device AI Hardware to Market
-
Mixed KV Cache Quantization: Performance Risks and Pitfalls
-
IBM Granite 4.0 3B Vision: Compact Enterprise-Grade Document AI
-
DaVinci-MagiHuman: Open-Source AI Model for Realistic Video Generation
-
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
-
Samsung Galaxy Book6 Series Brings Intel Core Ultra Chips for On-Device LLM Inference
-
Qwen3 512k Context via TurboQuant on Mac mini
-
GPU Passthrough to LXCs in Proxmox Simplifies Local LLM Deployment
-
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
-
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
-
Coding Implementation to Run Qwen3.5 Reasoning Models Distilled With Claude-Style Thinking Using GGUF and 4-Bit Quantization
-
Hold on to Your Hardware: Implications for Local LLM Deployment
-
Pluggable's TBT5-AI: First Thunderbolt Dock Explicitly Targeting Local LLM Workstations
-
NVIDIA Releases GPT-OSS-Puzzle-88B, a Deployment-Optimized Model
-
Show HN: Beforeyouship – Pre-Build Tool to Estimate LLM Cost
-
Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU
-
Intel Launches Arc Pro B70/B65 with 32GB VRAM for Local AI Inference
-
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
-
Google TurboQuant: Extreme Compression for Local LLM Deployment
-
OmniCoder v2 Released: Improved Code Generation for Local Deployment
-
Researcher Successfully Runs Local LLMs on Legacy "Dead" GPU With Surprising Results
-
Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared
-
Running a Private AI Brain on Windows PC as Alternative to Cloud Services
-
Powerful AI Search Engine Built on Single GeForce RTX 5090
-
Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives
-
Rust Project Perspectives on AI
-
Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations
-
Nvidia Nemotron Cascade 2 30B Emerges as Powerful Alternative to Qwen Models
-
Developer Builds Fully Local Multi-Agent System Using vLLM and Parallel Inference
-
Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting
-
ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B
-
Why You Should Use Both ChatGPT and Local LLMs: A Practical Hybrid Approach
-
Careless Whisper – Personal Local Speech to Text
-
AI Playground for Developers Built in Vite and Python
-
Qwen 3.5 397B emerges as top-performing local coding model
-
Multi-Token Prediction support coming to MLX-LM for Qwen 3.5
-
Apple M5 Max 128GB real-world performance benchmarks for local inference
-
Local AI Coding Assistant: Free Cursor Alternative with VS Code, Ollama & Continue
-
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide
-
Build a $1,500 AI Server with DeepSeek-R1 on RTX 4090
-
Qwen 3.5 Emerges as Top Performer for Local Deployment with Extensive Quantization Options
-
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
-
Repurpose Old GPUs as Dedicated AI Inference Accelerators
-
NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor
-
NVIDIA Nemotron 3 Nano 4B Enables On-Device Inference Directly in Web Browsers via WebGPU
-
Llamafile 0.10 Released with GPU Support and Rebuilt Core
-
Meet Sarvam Edge: India's AI Model That Runs on Phones and Laptops With No Internet
-
Tether's QVAC Introduces Cross-Platform Bitnet LoRA Framework for On-Device AI Training
-
Unsloth Studio: Open-Source Web UI for Training and Running LLMs Locally
-
Snapdragon 8 Elite Gen 5 Hands the Galaxy S26 the AI Upgrade We've Been Waiting For
-
MiniMax-M2.7: New Compact Model Announced for Local Deployment
-
Mamba 3: State Space Model Architecture Optimized for Inference
-
I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since
-
Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware
-
Run LLMs Locally with Llama.cpp
-
I Ran Local LLMs on a 'Dead' GPU, and the Results Surprised Me
-
Qwen 3.5 4B Outperforms Nvidia Nemotron 3 4B in Local Benchmarks
-
Mistral Small 4 119B Released with NVFP4 Quantisation Support
-
Mistral Releases Small 4 Open-Source Model Under Apache 2.0
-
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
-
OpenClaw Isn't the Only Raspberry Pi AI Tool—Here Are 4 Others You Can Try This Week
-
OmniCoder-9B: Efficient Coding Model for 8GB GPUs
-
This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference
-
Dictare – Open-source Voice Layer for AI Coding Agents (100% Local)
-
AMD Declares 'AI on the PC Has Crossed an Important Line' – Agent Computers as Next Breakthrough
-
Nvidia's Nemotron 3 Super: Understanding the Significance for Local LLM Deployment
-
Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference
-
Startup Transforms Mac Mini Into Full-Powered AI Inference System With External GPU
-
Two Local Models Prove Competitive Enough to Replace ChatGPT, Gemini, and Copilot
-
India's Mobile-First AI Strategy Could Accelerate Local Inference Adoption in Emerging Markets
-
Hybrid AI Desktop Layer Combining DOM-Automation and API-Integrations
-
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
-
AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
-
Achieving 2000 Tokens Per Second with QWEN 3.5 27B on RTX-5090
-
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
-
Local Manga Translator: Production LLM Pipeline with YOLO, OCR, and Inpainting
-
Best Local LLM Models 2026: Developer Comparison
-
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
-
Linux 7.0 AMDGPU Fixing Idle Power Issue For RDNA4 GPUs After Compute Workloads
-
How to Install OpenClaw with Ollama (Step-by-Step Tutorial)
-
Show HN: VmExit – An Experiment in AI-Native Computing
-
Sarvam Open-Sources 30B and 105B Reasoning Models
-
Qwodel – An Open-Source Unified Pipeline for LLM Quantization
-
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
-
Nvidia Pushes Jetson as Edge Hub for Open AI Models
-
Nvidia Releases Nemotron 3 Super: 120B MoE Model for Local Deployment
-
Apple M5 Max 128GB Benchmark Results for Local LLM Inference
-
The $1,500 Local AI Setup: DeepSeek-R1 on Consumer Hardware
-
Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup
-
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
-
Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results
-
Texas Instruments Launches NPU-Powered MCUs for Low-Power Edge AI
-
Sarvam Open-Sources 30B and 105B Reasoning Models
-
Qwen 3.5-35B Uncensored GGUF Models Now Available
-
Llama.cpp Celebrates Major Milestone: From Leak to Industry Standard
-
Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
-
HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026?
-
Fine-Tuned Qwen SLMs (0.6–8B) Demonstrate Competitive Performance Against Frontier LLMs on Specialized Tasks
-
Strix Halo (Ryzen AI Max+ 395) Achieves Strong Local Inference Performance with ROCm 7.2
-
Sarvam Open-Sources 30B and 105B Reasoning Models
-
Qwen 3.5 Family Benchmark Comparison Shows Strong Performance Across Smaller Models
-
Qwen 3.5 Derestricted Model Available for Local Deployment
-
When Running Ollama on Your PC for Local AI, One Thing Matters More Than Most
-
Nemotron 9B Powers Large-Scale Local Inference: Patent Classification and Real-Time Applications
-
Gyro-Claw – Secure Execution Runtime for AI Agents
-
Engram – Open-Source Persistent Memory for AI Agents
-
Qwen 3.5 27B Achieves Strong Local Inference Performance
-
Mistral AI Prepares Workflows Integration for Le Chat
-
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
-
ETH Zurich Research Challenges Context-Length Assumptions in LLM Agents
-
Apple Launches MacBook Neo with A18 Pro Chip for Affordable Local AI Inference
-
Windows 11 Notepad Gets On-Device AI Text Generation Without Subscription
-
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support
-
The Emerging Role of SRAM-Centric Chips in AI Inference
-
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs
-
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support
-
Kakao Launches Kanana AI for On-Device Schedule and Recommendation Management
-
Qwen 3.5-35B-A3B Achieves 37.8% on SWE-bench Verified Hard
-
Qwen 3.5-27B Q4 Quantization Comparison and Analysis
-
On-Device AI Laptop Lineups Become Standard Across Major Manufacturers
-
AMD Launches Copilot+ Desktop Chips to Compete in On-Device AI Market
-
VibeWhisper – macOS Voice-to-Text with 100% Local Processing Option
-
Qwen 3.5 0.8B Running in Browser with WebGPU via Transformers.js
-
Intel Arc Pro B70 Workstation GPU Confirmed via vLLM AI Release Notes
-
AMD Ryzen AI 400 Series Desktop Processors Launch with Integrated 60 TOPS NPU
-
Local LLM Performance Improvements: A Year of Progress Since DeepSeek R1 Moment
-
Jan Releases Code-Tuned 4B Model for Efficient Local Code Generation and Development Tasks
-
HP ZBook Ultra 14 G1a Workstation Reclaims Local AI Workflows for Professionals
-
Browser Use vs. Claude Computer Use: Comparing Agent Automation Frameworks
-
Apple Neural Engine Reverse-Engineered for Local Model Training on Mac Mini M4
-
Qwen 3.5-35B-A3B Emerges as Efficient Daily Driver, Replacing 120B Models
-
Nummi – AI Companion with Memory and Daily Guidance
-
4 Free Tools to Run Powerful AI on Your PC Without a Subscription
-
Apple Intelligence, Galaxy AI, Gemini: Why Your AI-Powered Phone Is Worth Repairing
-
Unsloth Dynamic 2.0 GGUFs
-
Qwen3.5-35B Unsloth Dynamic GGUFs Achieve SOTA Across Nearly All Quantisation Levels
-
Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal
-
Qwen 3.5-35B Unsloth Dynamic GGUFs Achieve SOTA Quantisation Benchmarks
-
Qwen 3.5-27B Demonstrates Exceptional Performance with Thoughtful Prompt Engineering
-
On-Device AI in Mobile Apps: What Should Run on the Phone vs the Cloud (A 2026 Decision Guide)
-
The ML.energy Leaderboard
-
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware
-
LLmFit: One-Command Hardware-Aware Model Selection Across 497 Models and 133 Providers
-
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080
-
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080
-
Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot
-
5 Useful Docker Containers for Agentic Developers
-
Show HN: Caret – Tab to Complete at Any App on Your Mac
-
Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti
-
Qwen 3.5 Underperforms on Hard Coding Tasks—APEX Benchmark Analysis
-
Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup
-
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required
-
DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
-
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
-
The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production
-
Qwen3.5-35B-A3B Emerges as Game-Changer for Agentic Coding Tasks
-
Qwen3.5-27B Identified as Sweet Spot for Mid-Range Local Deployment
-
PyTorch Foundation Announces New Members as Agentic AI Demand Grows
-
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals
-
Mirai Announces $10M to Advance On-Device AI Performance for Consumer Devices
-
Show HN: 100% LLM Accuracy–No Fine-Tuning, JSON Only
-
Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods
-
How AI is Redefining Price and Performance in Modern Laptops
-
Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
South Korea to Launch $687 Million Project to Develop On-Device AI Semiconductors
-
Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation
-
Custom Portable Workstation Optimized for Local AI Inference Builds
-
Open-Source Framework Achieves Gemini 3 Deep Think Level Performance Through Local Model Scaffolding
-
Nvidia Could Launch Its First Laptops With Its Own Processors
-
Local GPT-OSS 20B Model Demonstrates Practical Agentic Capabilities
-
A Tool to Tell You What LLMs Can Run on Your Machine
-
Open-Source llama.cpp Finds Long-Term Home at Hugging Face
-
GPT-OSS 20B Demonstrates Practical Agentic Capabilities Running Fully Locally
-
GLM-5 Becomes Top Open-Weights Model on Extended NYT Connections Benchmark
-
Elastic Introduces Best-in-Class Embedding Models for High Performance Semantic Search
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
Yet Another Fix Coming for Older AMD GPUs on Linux – Thanks to Valve Developer
-
Ouro 2.6B Thinking Model GGUFs Released with Q8_0 and Q4_K_M Quantization
-
O-TITANS: Orthogonal LoRA Framework for Gemma 3 with Google TITANS Memory Architecture
-
At India AI Impact Summit, Intel Showcases AI PCs and Cost-Efficient Frugal AI
-
Strix Halo Performance Benchmarks: Minimax M2.5, Step 3.5 Flash, Qwen3 Coder
-
Qwen3 Coder Next Remains Effective at Aggressive Quantization Levels
-
[Release] Ouro-2.6B-Thinking: ByteDance's Recurrent Model Now Runnable Locally
-
At India AI Impact Summit, Intel Showcases Its AI PCs and Cost-Efficient Frugal AI
-
GGML.AI Acquired by Hugging Face
-
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System
-
PaddleOCR-VL Now Integrated into llama.cpp for Multilingual OCR
-
NVIDIA Releases Dynamo v0.9.0: Infrastructure Overhaul With FlashIndexer and Multi-Modal Support
-
Mirai Secures $10M to Optimize On-Device AI Amid Cloud Cost Surge
-
Free ASIC-Accelerated Llama 3.1 8B Inference at 16,000 Tokens/Second
-
AI Integration in Sublime Text: Practical Local LLM Editor Enhancement
-
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
-
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM
-
Kitten TTS V0.8 Released: State-of-the-Art Super-Tiny Text-to-Speech Model Under 25MB
-
Hardware Economics Shift: DDR5 RDIMM Pricing Now Comparable to GPUs for Local Inference
-
Qwen3-Next 80B MoE Achieves 39 Tokens/Second on RTX 5070/5060 Ti Dual-GPU Setup
-
Qwen 3.5-397B-A17B Now Available for Local Inference with Aggressive Quantisation
-
High Bandwidth Flash Memory Could Alleviate VRAM Constraints in Local LLM Inference
-
Cohere Releases Tiny Aya: Efficient 3.3B Multilingual Model for 70+ Languages
-
ASUS Zenbook 14 Launches in India with AI-Capable Hardware, Starting at Rs 1,15,990
-
Ask HN: What is the best bang for buck budget AI coding?
-
GPU-Accelerated DataFrame Library for Local Inference Workloads
-
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release
-
NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x
-
MiniMax-M2.5 230B MoE Model Released with GGUF Support for Local Deployment
-
LLaDA2.1 Introduces Token Editing for Massive Speed Gains in Local Inference
-
GPT-OSS 20B Now Runs 100% Locally in Browser via WebGPU
-
GNOME's AI Assistant Newelle Adds llama.cpp Support and Command Execution
-
Context Management Identified as Real Bottleneck in AI-Assisted Coding
-
Ring-1T-2.5 Released with SOTA Deep Thinking Performance
-
The Future of AI Slop Is Constraints - Implications for Local Models
-
Samsung's REAM: Alternative Model Compression Technique
-
Running Mistral-7B on Intel NPU Achieves 12.6 Tokens/Second
-
GLM-5 Released: 744B Parameter MoE Model Targeting Complex Tasks
-
I Tried a Claude Code Rival That's Local, Open Source, and Completely Free
-
NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics
-
Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
-
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data
-
Community Member Builds 144GB VRAM Local LLM Powerhouse