Tagged "memory-optimization"

KV Cache Quantization Levels Benchmarked on SWE-bench: Practical Trade-offs for Local Inference 24 March 2026
FOMOE: Running 397B Parameter Qwen3.5 MoE at 5-9 tok/s on $2,100 Desktop Hardware 24 March 2026
Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives 22 March 2026
Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations 22 March 2026
Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting 22 March 2026
A Little Gap That Will Ensure the Future of AI Agents Being Autonomous 22 March 2026
Running an AI Agent on a 448KB RAM Microcontroller 21 March 2026
MacinAI Local brings functional LLM inference to classic Macintosh hardware 21 March 2026
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide 21 March 2026
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models 20 March 2026
NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor 20 March 2026
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform 20 March 2026
Mamba 3: State Space Model Architecture Optimized for Inference 18 March 2026
Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware 18 March 2026
Mistral Small 4 119B Released with NVFP4 Quantisation Support 17 March 2026
Researcher Discovers Universal "Danger Zone" in Transformer Model Architecture at 50% Depth 17 March 2026
The Moment AI Agents Stopped Being a Feature and Started Becoming a System 17 March 2026
OpenClaw Isn't the Only Raspberry Pi AI Tool—Here Are 4 Others You Can Try This Week 16 March 2026
OmniCoder-9B: Efficient Coding Model for 8GB GPUs 16 March 2026
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage 15 March 2026
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems 14 March 2026
Best Local LLM Models 2026: Developer Comparison 14 March 2026
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens 14 March 2026
Qwodel – An Open-Source Unified Pipeline for LLM Quantization 12 March 2026
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs 12 March 2026
Apple M5 Max 128GB Benchmark Results for Local LLM Inference 12 March 2026
Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results 11 March 2026
SK Hynix Develops 1c LPDDR6 DRAM to Boost On-Device AI Performance in Mobile Devices 10 March 2026
Mnemos: Persistent Memory System for Local AI Agents 10 March 2026
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems 10 March 2026
HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026? 10 March 2026
FreeBSD 14.4 Released: Implications for Local LLM Deployment 10 March 2026
Qwen 3.5 Derestricted Model Available for Local Deployment 9 March 2026
How to Run Your Own Local LLM — 2026 Edition 9 March 2026
Engram – Open-Source Persistent Memory for AI Agents 9 March 2026
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide 8 March 2026
Mojo: Creating a Programming Language for an AI World with Chris Lattner 7 March 2026
Show HN: Asterode – Multi-Model AI App with Memory and Power Features 7 March 2026
The Emerging Role of SRAM-Centric Chips in AI Inference 6 March 2026
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs 6 March 2026
How to Run High-Performance LLMs Locally on the Arduino UNO Q 1 March 2026
Nummi – AI Companion with Memory and Daily Guidance 1 March 2026
Unsloth Dynamic 2.0 GGUFs 28 February 2026
Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second 28 February 2026
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware 28 February 2026
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080 28 February 2026
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080 28 February 2026
Running LLMs on Raspberry Pi and Edge Devices: A Practical Guide 26 February 2026
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required 26 February 2026
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals 25 February 2026
Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods 25 February 2026
What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup 25 February 2026
Which Web Frameworks Are Most Token-Efficient for AI Agents? 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation 23 February 2026
The Complete Stack for Local Autonomous Agents: From GGML to Orchestration 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
O-TITANS: Orthogonal LoRA Framework for Gemma 3 with Google TITANS Memory Architecture 22 February 2026
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System 20 February 2026
Running Local LLMs and VLMs on Arduino UNO Q with yzma 19 February 2026
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs 19 February 2026
Local Vision-Language Models for Document OCR and PII Detection in Privacy-Critical Workflows 19 February 2026
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM 19 February 2026
InitRunner: YAML-Based AI Agent Framework with RAG and Memory 16 February 2026
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release 16 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 14 February 2026
NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x 14 February 2026
MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities 14 February 2026
GPT-OSS 120B Uncensored Model Released in Native MXFP4 Precision 14 February 2026
Context Management Identified as Real Bottleneck in AI-Assisted Coding 14 February 2026
Switching From Ollama and LM Studio to llama.cpp: Performance Benefits 13 February 2026
Ring-1T-2.5 Released with SOTA Deep Thinking Performance 13 February 2026
MiniMax M2.5: 230B Parameter MoE Model Coming to HuggingFace 13 February 2026
Ming-flash-omni-2.0: 100B MoE Omni-Modal Model Released 13 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 13 February 2026
Running Your Own AI Assistant for €19/Month: Complete Self-Hosting Guide 12 February 2026
Heaps Do Lie: Debugging a Memory Leak in vLLM 12 February 2026
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine 11 February 2026
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance 11 February 2026
Energy-Based Models Compared Against Frontier AI for Sudoku Solving 11 February 2026
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data 11 February 2026