Tagged "inference-speed"
- Tether AI Upgrades QVAC SDK With TurboQuant for Data Center-Sized Memory on Everyday Devices
- Qualcomm Reveals Snapdragon C with Advanced On-Device AI Engine
- NVIDIA Launches N1X/N1 CPU-GPU SoC for PC Market, Targeting Heavy On-Device AI Users
- Netflix Wiz Creates App to Slash AI Bills by Pruning Agent Instructions, Then Open-Sources It
- Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
- Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference
- Dell Launches 14 Plus Laptop with Intel Core Ultra 9 and 32GB RAM at $1,499.99, Enabling Local Model Inference
- New 8B Local LLM Design Marks Biggest Shift Since DeepSeek R1
- 110 Tokens/Second on RTX 4070 Super with Qwen 3.6 35B
- llama.cpp Checkpoint Fix Accelerates Local Coding Agents
- Google Makes Gemini 3.5 Flash the Default AI Model for Billions of Users
- Intel llm-scaler-vllm 1.4 Released With Updated Components and Arc Pro B70 Support
- Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B
- AMD's New Ryzen AI Max Pro 400 with 192GB LPDDR5X Memory
- Google Tensor SDK Beta with LiteRT Enables Efficient On-Device AI
- llama.cpp Adds Multi-Token Prediction, Doubles Qwen 3.6B Throughput for Local Inference
- Orthrus Reshapes Economics of Local AI Inference with New Optimization Approach
- llama.cpp Delivers Sharp Performance Gains for AMD RDNA3 Users
- DFlash Speculative Decoding Delivers 8.5x Speed Improvement for LLM Inference
- One LM Studio Setting Makes Local LLMs Competitive With Cloud Models
- Bun's Experimental Rust Rewrite Achieves 99.8% Test Compatibility on Linux
- Lemonade Gives AMD Startups a Wider Path to Local Inference
- Google Accelerates Gemma 4 Inference Speed 3x With Multi-Token Prediction Drafters
- Supercharging LLM Inference on Google TPUs: Achieving 3X Speedups With Diffusion-Style Speculative Decoding
- Gemma 4 Just Replaced My Whole Local LLM Stack
- Anker's Thus Chip Puts AI On-Device, Promising Faster Responses And Better Privacy
- PFlash Claims 10x Prefill Speedup Over llama.cpp
- Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
- Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp
- Linux Crushes Windows on llama.cpp Inference by Double Digits
- Show HN: We built an OCR server that can process 270 dense images/s on a 5090
- llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost
- LlaMa.cpp Robot Wars
- Unweight: Lossless MLP Weight Compression for LLM Inference
- Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
- DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max
- Fine-Tuned Qwen3.5-0.8B for OCR Outperforms Previous 2B Release
- oMLX Framework Implements DFlash Attention for Optimized Inference
- Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
- On-Device AI: Achieving Powerful AI Capabilities Without Internet Connectivity
- Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference
- DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon
- The Best Local AI Model for Home Assistant Isn't Always the Biggest One
- Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
- Google's Gemini Nano 4 Offers Faster, Smarter Local Inference Capabilities
- DMax: New Parallel Decoding Paradigm for Diffusion Language Models
- Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs
- Speculative Decoding Made My Local LLM Actually Usable
- TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration
- Gemma 4 26B Achieves Impressive Local Performance With Proper Configuration
- TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
- HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware
- Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups
- GMKtec NucBox K17 Launches with 97 TOPS AI Performance for Local Inference
- Mixed Precision Quantization on MLX with TurboQuant Implementation
- Kokoro TTS Achieves 20× Realtime Speed on CPU-Only On-Device Inference
- OpenUMA – Apple-Style Unified Memory for x86 AI Inference
- NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs
- Google Gemma 4 Released with GGUF Quantizations
- Gemma 4 26B A4B Outperforms Qwen 3.5 35B on Apple Silicon
- Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support
- TinyGPU Adds Mac Support for External Nvidia GPU Acceleration
- Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac
- Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains
- TurboQuant: Understanding the Quantization Breakthrough
- Linux Significantly Outperforms Windows for Local LLM Inference
- TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
- M5 Max Delivers 1.7x Faster Inference Than M3 Max on Qwen 3.5 Models
- TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
- RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU
- Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
- Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared
- Critical: LiteLLM Supply Chain Attack Detected, Bifrost Alternative Released