Tagged "inference-optimization"
-
FlashAttention-4 Delivers 2.7x Faster Inference with 1613 TFLOPs/s on Blackwell GPUs
-
Qwen 3.5 Models: Optimal Settings and Reduced Overthinking Configuration
-
LM Studio Releases Reworked Plugins with Fully Local Web Research
-
Rust Project Perspectives on AI
-
Nvidia Nemotron Cascade 2 30B Emerges as Powerful Alternative to Qwen Models
-
ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B
-
Multi-Token Prediction support coming to MLX-LM for Qwen 3.5
-
Apple M5 Max 128GB real-world performance benchmarks for local inference
-
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide
-
Repurpose Old GPUs as Dedicated AI Inference Accelerators
-
NVIDIA Nemotron 3 Nano 4B Enables On-Device Inference Directly in Web Browsers via WebGPU
-
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
-
Llamafile 0.10 Released with GPU Support and Rebuilt Core
-
AI's Impact on Mathematics Analogous to Car's Impact on Cities
-
Meet Sarvam Edge: India's AI Model That Runs on Phones and Laptops With No Internet
-
Multiverse Computing Targets On-Device AI With Compressed Models and New API Portal
-
Kilo Is the VS Code Extension That Actually Works With Every Local LLM I Throw At It
-
Dell Pro Max 16 Plus Launches With Enterprise-Grade Discrete NPU for On-Device AI
-
Tether's QVAC Introduces Cross-Platform Bitnet LoRA Framework for On-Device AI Training
-
Snapdragon 8 Elite Gen 5 Hands the Galaxy S26 the AI Upgrade We've Been Waiting For
-
Mamba 3: State Space Model Architecture Optimized for Inference
-
I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since
-
You're Using Your Local LLM Wrong If You're Prompting It Like a Cloud LLM
-
Run LLMs Locally with Llama.cpp
-
I Ran Local LLMs on a 'Dead' GPU, and the Results Surprised Me
-
A New Magnetic Material for the AI Era
-
Mistral Small 4 119B Released with NVFP4 Quantisation Support
-
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
-
Practical Fix for Qwen 3.5 Overthinking in llama.cpp
-
This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference
-
AMD Declares 'AI on the PC Has Crossed an Important Line' – Agent Computers as Next Breakthrough
-
Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
-
Nvidia's Nemotron 3 Super: Understanding the Significance for Local LLM Deployment
-
Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference
-
Startup Transforms Mac Mini Into Full-Powered AI Inference System With External GPU
-
AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
-
Achieving 2000 Tokens Per Second with QWEN 3.5 27B on RTX-5090
-
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
-
Intel OpenVINO Backend Support Now Available in llama.cpp
-
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
-
Lemonade v10 Brings Linux NPU Support and Multi-Modal Capabilities
-
Fine-Tuned 14B Model Outperforms Claude Opus 4.6 on Ada Code Generation
-
Runpod Report: Qwen Has Overtaken Meta's Llama As The Most-Deployed Self-Hosted LLM
-
Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models
-
Qwodel – An Open-Source Unified Pipeline for LLM Quantization
-
Nvidia Pushes Jetson as Edge Hub for Open AI Models
-
Nvidia Releases Nemotron 3 Super: 120B MoE Model for Local Deployment
-
The $1,500 Local AI Setup: DeepSeek-R1 on Consumer Hardware
-
Llama.cpp Adds True Reasoning Budget Support
-
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
-
NVIDIA Jetson Brings Open Models to Life at the Edge
-
Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
-
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems
-
Google Delivers On-Device AI Features in New Chromebook Plus Model
-
Fish Audio Open-Sources S2: Expressive Text-to-Speech with Natural Language Control and 100ms Latency
-
Fine-Tuned Qwen SLMs (0.6–8B) Demonstrate Competitive Performance Against Frontier LLMs on Specialized Tasks
-
M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference
-
Strix Halo (Ryzen AI Max+ 395) Achieves Strong Local Inference Performance with ROCm 7.2
-
When Running Ollama on Your PC for Local AI, One Thing Matters More Than Most
-
Nemotron 9B Powers Large-Scale Local Inference: Patent Classification and Real-Time Applications
-
Samsung Opens Registration for Vision AI QLED and OLED Television Integration
-
Qwen 3.5 27B Achieves Strong Local Inference Performance
-
Mistral AI Prepares Workflows Integration for Le Chat
-
Benchmark: Local Open-Source LLMs Competitive in Real-Time Trading Applications
-
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
-
Building PyTorch-Native Support for IBM Spyre Accelerator
-
Mojo: Creating a Programming Language for an AI World with Chris Lattner
-
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support
-
Show HN: TLDR – Free Chrome Extension for AI-Powered Article Summarization
-
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs
-
Building PyTorch-Native Support for IBM Spyre Accelerator
-
HyperExcel Seeks 150 Billion Won Series B to Scale LPU and Verda in Korea
-
Alibaba Releases Qwen 3.5 AI Model with On-Device AI Support
-
Kakao Launches Kanana AI for On-Device Schedule and Recommendation Management
-
SynthesisOS – A Local-First, Agentic Desktop Layer Built in Rust
-
Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI
-
AMD Launches Copilot+ Desktop Chips to Compete in On-Device AI Market
-
Qwen 3.5 0.8B Successfully Deployed on 7-Year-Old Samsung S10E Using llama.cpp
-
Alibaba's Qwen 3.5 Small Model Runs Directly on iPhone 17
-
Qualcomm Launches Snapdragon Wear Elite for On-Device AI on Wearables
-
Local LLM Performance Improvements: A Year of Progress Since DeepSeek R1 Moment
-
HP ZBook Ultra 14 G1a Workstation Reclaims Local AI Workflows for Professionals
-
RAG-Enterprise – 100% Local RAG System for Enterprise Documents
-
Switch Qwen 3.5 Thinking Mode On/Off Without Model Reload Using setParamsByID
-
Qwen 3.5-35B-A3B Emerges as Efficient Daily Driver, Replacing 120B Models
-
Google Research Finds Longer Chain-of-Thought Correlates Negatively With Accuracy
-
4 Free Tools to Run Powerful AI on Your PC Without a Subscription
-
Bare-Metal LLM Inference: UEFI Application Boots Directly Into LLM Chat
-
Unsloth Dynamic 2.0 GGUFs
-
Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal
-
Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second
-
Qwen 3.5-27B Demonstrates Exceptional Performance with Thoughtful Prompt Engineering
-
The ML.energy Leaderboard
-
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware
-
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080
-
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080
-
Snapdragon 8 Elite Gen 5 Powers Galaxy S26 Series With Enhanced On-Device AI
-
Extracting 100K Concepts from an 8B LLM
-
Show HN: Caret – Tab to Complete at Any App on Your Mac
-
Arduino, Qualcomm Bring On-Device AI and Robotics Learning to Indian School Systems
-
Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti
-
Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup
-
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required
-
Ollama for JavaScript Developers: Building AI Apps Without API Keys
-
DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
-
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
-
The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production
-
New Era of On-Device AI Driven by High-Speed UFS 5.0 Storage
-
Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
-
Qwen3.5-35B-A3B Emerges as Game-Changer for Agentic Coding Tasks
-
Qwen3.5-27B Identified as Sweet Spot for Mid-Range Local Deployment
-
PyTorch Foundation Announces New Members as Agentic AI Demand Grows
-
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals
-
Mirai Announces $10M to Advance On-Device AI Performance for Consumer Devices
-
Show HN: 100% LLM Accuracy–No Fine-Tuning, JSON Only
-
Which Web Frameworks Are Most Token-Efficient for AI Agents?
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
South Korea to Launch $687 Million Project to Develop On-Device AI Semiconductors
-
Custom Portable Workstation Optimized for Local AI Inference Builds
-
GPT-OSS 20B Demonstrates Practical Agentic Capabilities Running Fully Locally
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
Show HN: Tickr – AI Project Manager That Lives Inside Slack (Replaces Jira)
-
How Slow Local LLMs Are on My Framework 13 AMD Strix Point
-
At India AI Impact Summit, Intel Showcases AI PCs and Cost-Efficient Frugal AI
-
Show HN: Horizon – My AI-Powered Personal News Aggregator and Summarizer
-
DietPi Released a New Version v10.1
-
Asus ExpertBook B3 G2 with 50 TOPS AI Sets New Enterprise Standard
-
Vellium v0.3.5: Major Writing Mode Overhaul and Native KoboldCpp Support
-
Taalas Etches AI Models onto Transistors to Rocket Boost Inference
-
Strix Halo Performance Benchmarks: Minimax M2.5, Step 3.5 Flash, Qwen3 Coder
-
I Run Local LLMs in One of the World's Priciest Energy Markets, and I Can Barely Tell
-
[Release] Ouro-2.6B-Thinking: ByteDance's Recurrent Model Now Runnable Locally
-
Google Is Exploring Ways to Use Its Financial Might to Take on Nvidia
-
GGML.AI Acquired by Hugging Face
-
Apple Researchers Develop On-Device AI Agent That Interacts With Apps for You
-
TemplateFlow – Build AI Workflows, Not Prompts
-
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System
-
The Path to Ubiquitous AI (17k tokens/sec)
-
Free ASIC-Accelerated Llama 3.1 8B Inference at 16,000 Tokens/Second
-
AI Integration in Sublime Text: Practical Local LLM Editor Enhancement
-
Self-Hosted Local LLMs for Document Management with Paperless-ngx
-
Sarvam Brings AI to Feature Phones, Cars, and Smart Glasses
-
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
-
Local-First RAG: Vector Search in SQLite with Hamming Distance
-
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM
-
Alibaba's Qwen3.5-397B Achieves #3 Position in Open Weights Model Rankings
-
Qualcomm Ventures Positions India as Blueprint for Affordable On-Device AI Infrastructure
-
Cloudflare Releases Agents SDK v0.5.0 with Rust-Powered Infire Engine for Edge Inference
-
Can We Leverage AI/LLMs for Self-Learning?
-
AMD Announces Day 0 Support for Qwen 3.5 LLM on Instinct GPUs
-
Qwen3-Next 80B MoE Achieves 39 Tokens/Second on RTX 5070/5060 Ti Dual-GPU Setup
-
Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter
-
High Bandwidth Flash Memory Could Alleviate VRAM Constraints in Local LLM Inference
-
Cohere Releases Tiny Aya: Efficient 3.3B Multilingual Model for 70+ Languages
-
Chinese AI Chipmaker Axera Semiconductor Plans $379 Million Hong Kong IPO for Edge Inference Hardware
-
Asus ExpertBook B3 G2 Laptop Features Ryzen AI 9 HX 470 CPU in 1.41kg Ultraportable Form Factor
-
Ask HN: What is the best bang for buck budget AI coding?
-
GPU-Accelerated DataFrame Library for Local Inference Workloads
-
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release
-
Switching From Ollama And LM Studio To llama.cpp: A Performance Comparison
-
LLaDA2.1 Introduces Token Editing for Massive Speed Gains in Local Inference
-
First Vibecoded AI Operating System for Local Deployment
-
Ring-1T-2.5 Released with SOTA Deep Thinking Performance
-
The Future of AI Slop Is Constraints - Implications for Local Models
-
ByteDance Releases Seedance 2.0 AI Development Platform
-
Running Mistral-7B on Intel NPU Achieves 12.6 Tokens/Second
-
Use Recursive Language Models to address huge contexts for local LLM
-
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine
-
NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics
-
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance
-
Energy-Based Models Compared Against Frontier AI for Sudoku Solving
-
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data