Tagged "inference-speed"

I Thought My Local AI Would Replace My Claude Subscription — Then I Tried Automating My PC 17 July 2026
llama.cpp's 4.26× Intel Gain Has a Narrow Catch 16 July 2026
Nvidia Boosts Token Throughput 5x With Software Optimizations, Reshaping AI Inference Economics 14 July 2026
Study: Cerebellum Helps AI Ignore the Ordinary for More Efficient Computing 11 July 2026
Show HN: OpenVole 4.5 Is Out 10 July 2026
Exploiting Sparsity for Long Context Inference: Million Token on Commodity GPUs 10 July 2026
Ollama's New MLX Engine Delivers Significant Performance Gains on Mac 5 July 2026
Theoretical Bottlenecks for Scaling LLM Inference to Achieve Higher Token per Second 2 July 2026
How to Choose Between Small and Frontier Models 30 June 2026
TriAttention Solves KV Cache Memory Bottleneck in Local LLM Inference 28 June 2026
ORA: Smaller Models. Same Intelligence 25 June 2026
NVIDIA DFlash Block Diffusion Accelerates Autoregressive LLM Inference 25 June 2026
Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding 24 June 2026
Samsung's UFS 5.0 Addresses Critical Memory Bandwidth Bottleneck in Mobile AI Inference 23 June 2026
FlashRT: Execution State for Latency-First AI 20 June 2026
Ray Serve LLM Achieves 24x Performance Improvement in Distributed Inference 19 June 2026
Google's DiffusionGemma Brings Novel Text Generation to Local LLMs 17 June 2026
CacheWise Optimizes KVCache Reuse for LLM Coding Agents 16 June 2026
AMD Brings Data Center-Level AI Performance to PCs 16 June 2026
RTX 5080 and RTX 3090 Setup Achieves 80 Tok/s on Qwen 3.6 27B Q8 13 June 2026
vLLM vs Ollama 2026: 793 vs 41 TPS Performance Benchmark 12 June 2026
Google's DiffusionGemma Achieves 4x Faster Text Generation for Local Deployment 12 June 2026
Hermes with Ollama Emerges as Top Choice for Desktop AI Tools 11 June 2026
Apple Enhances Siri With On-Device AI for Faster, Private Voice Responses 8 June 2026
NVIDIA Unveils First PC Chips at Computex 2026; CEO Jensen Huang Details New Hardware 7 June 2026
Best Local LLM Setup for RTX 5090: llama.cpp Fork with TurboQuant 7 June 2026
NVIDIA Dynamo Snapshot Accelerates AI Inference Startup on Kubernetes 6 June 2026
Show HN: Lowfat – Pluggable CLI Filter Saving 91.8% of LLM Tokens 5 June 2026
Longsys Redefines On-Device AI with Groundbreaking Edge Memory Solutions 4 June 2026
Perplexity Unveils Hybrid Local-Cloud Inference System for Intelligent Task Distribution 3 June 2026
NVIDIA RTX Spark Superchip Delivers 6,144 CUDA Cores for Consumer Local AI Inference 3 June 2026
Tether AI Upgrades QVAC SDK With TurboQuant for Data Center-Sized Memory on Everyday Devices 2 June 2026
Qualcomm Reveals Snapdragon C with Advanced On-Device AI Engine 1 June 2026
NVIDIA Launches N1X/N1 CPU-GPU SoC for PC Market, Targeting Heavy On-Device AI Users 1 June 2026
Netflix Wiz Creates App to Slash AI Bills by Pruning Agent Instructions, Then Open-Sources It 31 May 2026
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request 29 May 2026
Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference 27 May 2026
Dell Launches 14 Plus Laptop with Intel Core Ultra 9 and 32GB RAM at $1,499.99, Enabling Local Model Inference 26 May 2026
New 8B Local LLM Design Marks Biggest Shift Since DeepSeek R1 23 May 2026
110 Tokens/Second on RTX 4070 Super with Qwen 3.6 35B 22 May 2026
llama.cpp Checkpoint Fix Accelerates Local Coding Agents 22 May 2026
Google Makes Gemini 3.5 Flash the Default AI Model for Billions of Users 22 May 2026
Intel llm-scaler-vllm 1.4 Released With Updated Components and Arc Pro B70 Support 21 May 2026
Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B 21 May 2026
AMD's New Ryzen AI Max Pro 400 with 192GB LPDDR5X Memory 21 May 2026
Google Tensor SDK Beta with LiteRT Enables Efficient On-Device AI 20 May 2026
llama.cpp Adds Multi-Token Prediction, Doubles Qwen 3.6B Throughput for Local Inference 19 May 2026
Orthrus Reshapes Economics of Local AI Inference with New Optimization Approach 16 May 2026
llama.cpp Delivers Sharp Performance Gains for AMD RDNA3 Users 15 May 2026
DFlash Speculative Decoding Delivers 8.5x Speed Improvement for LLM Inference 11 May 2026
One LM Studio Setting Makes Local LLMs Competitive With Cloud Models 10 May 2026
Bun's Experimental Rust Rewrite Achieves 99.8% Test Compatibility on Linux 9 May 2026
Lemonade Gives AMD Startups a Wider Path to Local Inference 9 May 2026
Google Accelerates Gemma 4 Inference Speed 3x With Multi-Token Prediction Drafters 6 May 2026
Supercharging LLM Inference on Google TPUs: Achieving 3X Speedups With Diffusion-Style Speculative Decoding 5 May 2026
Gemma 4 Just Replaced My Whole Local LLM Stack 4 May 2026
Anker's Thus Chip Puts AI On-Device, Promising Faster Responses And Better Privacy 4 May 2026
PFlash Claims 10x Prefill Speedup Over llama.cpp 2 May 2026
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful 28 April 2026
Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp 28 April 2026
Linux Crushes Windows on llama.cpp Inference by Double Digits 27 April 2026
Show HN: We built an OCR server that can process 270 dense images/s on a 5090 23 April 2026
llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost 20 April 2026
LlaMa.cpp Robot Wars 19 April 2026
Unweight: Lossless MLP Weight Compression for LLM Inference 18 April 2026
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation 18 April 2026
DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max 15 April 2026
Fine-Tuned Qwen3.5-0.8B for OCR Outperforms Previous 2B Release 14 April 2026
oMLX Framework Implements DFlash Attention for Optimized Inference 14 April 2026
Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B 13 April 2026
On-Device AI: Achieving Powerful AI Capabilities Without Internet Connectivity 12 April 2026
Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference 12 April 2026
DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon 12 April 2026
The Best Local AI Model for Home Assistant Isn't Always the Biggest One 12 April 2026
Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B 11 April 2026
Google's Gemini Nano 4 Offers Faster, Smarter Local Inference Capabilities 11 April 2026
DMax: New Parallel Decoding Paradigm for Diffusion Language Models 11 April 2026
Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs 10 April 2026
Speculative Decoding Made My Local LLM Actually Usable 9 April 2026
TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration 7 April 2026
Gemma 4 26B Achieves Impressive Local Performance With Proper Configuration 7 April 2026
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache 6 April 2026
HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware 6 April 2026
Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups 5 April 2026
GMKtec NucBox K17 Launches with 97 TOPS AI Performance for Local Inference 5 April 2026
Mixed Precision Quantization on MLX with TurboQuant Implementation 4 April 2026
Kokoro TTS Achieves 20× Realtime Speed on CPU-Only On-Device Inference 4 April 2026
OpenUMA – Apple-Style Unified Memory for x86 AI Inference 3 April 2026
NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs 3 April 2026
Google Gemma 4 Released with GGUF Quantizations 3 April 2026
Gemma 4 26B A4B Outperforms Qwen 3.5 35B on Apple Silicon 3 April 2026
Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support 2 April 2026
TinyGPU Adds Mac Support for External Nvidia GPU Acceleration 2 April 2026
Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac 1 April 2026
Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains 1 April 2026
TurboQuant: Understanding the Quantization Breakthrough 29 March 2026
Linux Significantly Outperforms Windows for Local LLM Inference 29 March 2026
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context 28 March 2026
M5 Max Delivers 1.7x Faster Inference Than M3 Max on Qwen 3.5 Models 28 March 2026
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice 27 March 2026
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra 27 March 2026
Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config 27 March 2026
Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU 26 March 2026
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching 26 March 2026
Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared 25 March 2026
Critical: LiteLLM Supply Chain Attack Detected, Bifrost Alternative Released 25 March 2026