Tagged "inference-optimization"

FlashAttention-4 Delivers 2.7x Faster Inference with 1613 TFLOPs/s on Blackwell GPUs 24 March 2026
Qwen 3.5 Models: Optimal Settings and Reduced Overthinking Configuration 23 March 2026
LM Studio Releases Reworked Plugins with Fully Local Web Research 23 March 2026
Rust Project Perspectives on AI 22 March 2026
Nvidia Nemotron Cascade 2 30B Emerges as Powerful Alternative to Qwen Models 22 March 2026
ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B 22 March 2026
Multi-Token Prediction support coming to MLX-LM for Qwen 3.5 21 March 2026
Apple M5 Max 128GB real-world performance benchmarks for local inference 21 March 2026
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide 21 March 2026
Repurpose Old GPUs as Dedicated AI Inference Accelerators 20 March 2026
NVIDIA Nemotron 3 Nano 4B Enables On-Device Inference Directly in Web Browsers via WebGPU 20 March 2026
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform 20 March 2026
Llamafile 0.10 Released with GPU Support and Rebuilt Core 20 March 2026
AI's Impact on Mathematics Analogous to Car's Impact on Cities 20 March 2026
Meet Sarvam Edge: India's AI Model That Runs on Phones and Laptops With No Internet 19 March 2026
Multiverse Computing Targets On-Device AI With Compressed Models and New API Portal 19 March 2026
Kilo Is the VS Code Extension That Actually Works With Every Local LLM I Throw At It 19 March 2026
Dell Pro Max 16 Plus Launches With Enterprise-Grade Discrete NPU for On-Device AI 19 March 2026
Tether's QVAC Introduces Cross-Platform Bitnet LoRA Framework for On-Device AI Training 19 March 2026
Snapdragon 8 Elite Gen 5 Hands the Galaxy S26 the AI Upgrade We've Been Waiting For 18 March 2026
Mamba 3: State Space Model Architecture Optimized for Inference 18 March 2026
I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since 18 March 2026
You're Using Your Local LLM Wrong If You're Prompting It Like a Cloud LLM 18 March 2026
Run LLMs Locally with Llama.cpp 17 March 2026
I Ran Local LLMs on a 'Dead' GPU, and the Results Surprised Me 17 March 2026
A New Magnetic Material for the AI Era 17 March 2026
Mistral Small 4 119B Released with NVFP4 Quantisation Support 17 March 2026
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead 17 March 2026
Practical Fix for Qwen 3.5 Overthinking in llama.cpp 16 March 2026
This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference 16 March 2026
AMD Declares 'AI on the PC Has Crossed an Important Line' – Agent Computers as Next Breakthrough 16 March 2026
Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel 15 March 2026
Nvidia's Nemotron 3 Super: Understanding the Significance for Local LLM Deployment 15 March 2026
Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference 15 March 2026
Startup Transforms Mac Mini Into Full-Powered AI Inference System With External GPU 15 March 2026
AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon 15 March 2026
Achieving 2000 Tokens Per Second with QWEN 3.5 27B on RTX-5090 14 March 2026
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM 14 March 2026
Intel OpenVINO Backend Support Now Available in llama.cpp 14 March 2026
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems 14 March 2026
Lemonade v10 Brings Linux NPU Support and Multi-Modal Capabilities 14 March 2026
Fine-Tuned 14B Model Outperforms Claude Opus 4.6 on Ada Code Generation 14 March 2026
Runpod Report: Qwen Has Overtaken Meta's Llama As The Most-Deployed Self-Hosted LLM 13 March 2026
Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models 13 March 2026
Qwodel – An Open-Source Unified Pipeline for LLM Quantization 12 March 2026
Nvidia Pushes Jetson as Edge Hub for Open AI Models 12 March 2026
Nvidia Releases Nemotron 3 Super: 120B MoE Model for Local Deployment 12 March 2026
The $1,500 Local AI Setup: DeepSeek-R1 on Consumer Hardware 12 March 2026
Llama.cpp Adds True Reasoning Budget Support 12 March 2026
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia 12 March 2026
NVIDIA Jetson Brings Open Models to Life at the Edge 11 March 2026
Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming 10 March 2026
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems 10 March 2026
Google Delivers On-Device AI Features in New Chromebook Plus Model 10 March 2026
Fish Audio Open-Sources S2: Expressive Text-to-Speech with Natural Language Control and 100ms Latency 10 March 2026
Fine-Tuned Qwen SLMs (0.6–8B) Demonstrate Competitive Performance Against Frontier LLMs on Specialized Tasks 10 March 2026
M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference 10 March 2026
Strix Halo (Ryzen AI Max+ 395) Achieves Strong Local Inference Performance with ROCm 7.2 9 March 2026
When Running Ollama on Your PC for Local AI, One Thing Matters More Than Most 9 March 2026
Nemotron 9B Powers Large-Scale Local Inference: Patent Classification and Real-Time Applications 9 March 2026
Samsung Opens Registration for Vision AI QLED and OLED Television Integration 8 March 2026
Qwen 3.5 27B Achieves Strong Local Inference Performance 8 March 2026
Mistral AI Prepares Workflows Integration for Le Chat 8 March 2026
Benchmark: Local Open-Source LLMs Competitive in Real-Time Trading Applications 8 March 2026
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide 8 March 2026
Building PyTorch-Native Support for IBM Spyre Accelerator 7 March 2026
Mojo: Creating a Programming Language for an AI World with Chris Lattner 7 March 2026
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support 7 March 2026
Show HN: TLDR – Free Chrome Extension for AI-Powered Article Summarization 6 March 2026
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs 6 March 2026
Building PyTorch-Native Support for IBM Spyre Accelerator 6 March 2026
HyperExcel Seeks 150 Billion Won Series B to Scale LPU and Verda in Korea 6 March 2026
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support 6 March 2026
Kakao Launches Kanana AI for On-Device Schedule and Recommendation Management 5 March 2026
SynthesisOS – A Local-First, Agentic Desktop Layer Built in Rust 4 March 2026
Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI 4 March 2026
AMD Launches Copilot+ Desktop Chips to Compete in On-Device AI Market 4 March 2026
Qwen 3.5 0.8B Successfully Deployed on 7-Year-Old Samsung S10E Using llama.cpp 3 March 2026
Alibaba's Qwen 3.5 Small Model Runs Directly on iPhone 17 3 March 2026
Qualcomm Launches Snapdragon Wear Elite for On-Device AI on Wearables 2 March 2026
Local LLM Performance Improvements: A Year of Progress Since DeepSeek R1 Moment 2 March 2026
HP ZBook Ultra 14 G1a Workstation Reclaims Local AI Workflows for Professionals 2 March 2026
RAG-Enterprise – 100% Local RAG System for Enterprise Documents 1 March 2026
Switch Qwen 3.5 Thinking Mode On/Off Without Model Reload Using setParamsByID 1 March 2026
Qwen 3.5-35B-A3B Emerges as Efficient Daily Driver, Replacing 120B Models 1 March 2026
Google Research Finds Longer Chain-of-Thought Correlates Negatively With Accuracy 1 March 2026
4 Free Tools to Run Powerful AI on Your PC Without a Subscription 1 March 2026
Bare-Metal LLM Inference: UEFI Application Boots Directly Into LLM Chat 1 March 2026
Unsloth Dynamic 2.0 GGUFs 28 February 2026
Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal 28 February 2026
Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second 28 February 2026
Qwen 3.5-27B Demonstrates Exceptional Performance with Thoughtful Prompt Engineering 28 February 2026
The ML.energy Leaderboard 28 February 2026
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware 28 February 2026
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080 28 February 2026
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080 28 February 2026
Snapdragon 8 Elite Gen 5 Powers Galaxy S26 Series With Enhanced On-Device AI 27 February 2026
Extracting 100K Concepts from an 8B LLM 27 February 2026
Show HN: Caret – Tab to Complete at Any App on Your Mac 27 February 2026
Arduino, Qualcomm Bring On-Device AI and Robotics Learning to Indian School Systems 27 February 2026
Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti 26 February 2026
Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup 26 February 2026
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required 26 February 2026
Ollama for JavaScript Developers: Building AI Apps Without API Keys 26 February 2026
DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference 26 February 2026
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference 26 February 2026
The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production 26 February 2026
New Era of On-Device AI Driven by High-Speed UFS 5.0 Storage 25 February 2026
Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization 25 February 2026
Qwen3.5-35B-A3B Emerges as Game-Changer for Agentic Coding Tasks 25 February 2026
Qwen3.5-27B Identified as Sweet Spot for Mid-Range Local Deployment 25 February 2026
PyTorch Foundation Announces New Members as Agentic AI Demand Grows 25 February 2026
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals 25 February 2026
Mirai Announces $10M to Advance On-Device AI Performance for Consumer Devices 25 February 2026
Show HN: 100% LLM Accuracy–No Fine-Tuning, JSON Only 25 February 2026
Which Web Frameworks Are Most Token-Efficient for AI Agents? 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
South Korea to Launch $687 Million Project to Develop On-Device AI Semiconductors 23 February 2026
Custom Portable Workstation Optimized for Local AI Inference Builds 23 February 2026
GPT-OSS 20B Demonstrates Practical Agentic Capabilities Running Fully Locally 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Show HN: Tickr – AI Project Manager That Lives Inside Slack (Replaces Jira) 22 February 2026
How Slow Local LLMs Are on My Framework 13 AMD Strix Point 22 February 2026
At India AI Impact Summit, Intel Showcases AI PCs and Cost-Efficient Frugal AI 22 February 2026
Show HN: Horizon – My AI-Powered Personal News Aggregator and Summarizer 22 February 2026
DietPi Released a New Version v10.1 22 February 2026
Asus ExpertBook B3 G2 with 50 TOPS AI Sets New Enterprise Standard 22 February 2026
Vellium v0.3.5: Major Writing Mode Overhaul and Native KoboldCpp Support 21 February 2026
Taalas Etches AI Models onto Transistors to Rocket Boost Inference 21 February 2026
Strix Halo Performance Benchmarks: Minimax M2.5, Step 3.5 Flash, Qwen3 Coder 21 February 2026
I Run Local LLMs in One of the World's Priciest Energy Markets, and I Can Barely Tell 21 February 2026
[Release] Ouro-2.6B-Thinking: ByteDance's Recurrent Model Now Runnable Locally 21 February 2026
Google Is Exploring Ways to Use Its Financial Might to Take on Nvidia 21 February 2026
GGML.AI Acquired by Hugging Face 21 February 2026
Apple Researchers Develop On-Device AI Agent That Interacts With Apps for You 21 February 2026
TemplateFlow – Build AI Workflows, Not Prompts 20 February 2026
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System 20 February 2026
The Path to Ubiquitous AI (17k tokens/sec) 20 February 2026
Free ASIC-Accelerated Llama 3.1 8B Inference at 16,000 Tokens/Second 20 February 2026
AI Integration in Sublime Text: Practical Local LLM Editor Enhancement 19 February 2026
Self-Hosted Local LLMs for Document Management with Paperless-ngx 19 February 2026
Sarvam Brings AI to Feature Phones, Cars, and Smart Glasses 19 February 2026
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs 19 February 2026
Local-First RAG: Vector Search in SQLite with Hamming Distance 19 February 2026
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM 19 February 2026
Alibaba's Qwen3.5-397B Achieves #3 Position in Open Weights Model Rankings 18 February 2026
Qualcomm Ventures Positions India as Blueprint for Affordable On-Device AI Infrastructure 18 February 2026
Cloudflare Releases Agents SDK v0.5.0 with Rust-Powered Infire Engine for Edge Inference 18 February 2026
Can We Leverage AI/LLMs for Self-Learning? 18 February 2026
AMD Announces Day 0 Support for Qwen 3.5 LLM on Instinct GPUs 18 February 2026
Qwen3-Next 80B MoE Achieves 39 Tokens/Second on RTX 5070/5060 Ti Dual-GPU Setup 17 February 2026
Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter 17 February 2026
High Bandwidth Flash Memory Could Alleviate VRAM Constraints in Local LLM Inference 17 February 2026
Cohere Releases Tiny Aya: Efficient 3.3B Multilingual Model for 70+ Languages 17 February 2026
Chinese AI Chipmaker Axera Semiconductor Plans $379 Million Hong Kong IPO for Edge Inference Hardware 17 February 2026
Asus ExpertBook B3 G2 Laptop Features Ryzen AI 9 HX 470 CPU in 1.41kg Ultraportable Form Factor 17 February 2026
Ask HN: What is the best bang for buck budget AI coding? 17 February 2026
GPU-Accelerated DataFrame Library for Local Inference Workloads 16 February 2026
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release 16 February 2026
Switching From Ollama And LM Studio To llama.cpp: A Performance Comparison 14 February 2026
LLaDA2.1 Introduces Token Editing for Massive Speed Gains in Local Inference 14 February 2026
First Vibecoded AI Operating System for Local Deployment 13 February 2026
Ring-1T-2.5 Released with SOTA Deep Thinking Performance 13 February 2026
The Future of AI Slop Is Constraints - Implications for Local Models 13 February 2026
ByteDance Releases Seedance 2.0 AI Development Platform 12 February 2026
Running Mistral-7B on Intel NPU Achieves 12.6 Tokens/Second 12 February 2026
Use Recursive Language Models to address huge contexts for local LLM 12 February 2026
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine 11 February 2026
NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics 11 February 2026
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance 11 February 2026
Energy-Based Models Compared Against Frontier AI for Sudoku Solving 11 February 2026
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data 11 February 2026