Tagged "memory-optimization"
- KV Cache Quantization Levels Benchmarked on SWE-bench: Practical Trade-offs for Local Inference
- FOMOE: Running 397B Parameter Qwen3.5 MoE at 5-9 tok/s on $2,100 Desktop Hardware
- Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives
- Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations
- Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting
- A Little Gap That Will Ensure the Future of AI Agents Being Autonomous
- Running an AI Agent on a 448KB RAM Microcontroller
- MacinAI Local brings functional LLM inference to classic Macintosh hardware
- DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide
- Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
- NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor
- LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
- Mamba 3: State Space Model Architecture Optimized for Inference
- Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware
- Mistral Small 4 119B Released with NVFP4 Quantisation Support
- Researcher Discovers Universal "Danger Zone" in Transformer Model Architecture at 50% Depth
- The Moment AI Agents Stopped Being a Feature and Started Becoming a System
- OpenClaw Isn't the Only Raspberry Pi AI Tool—Here Are 4 Others You Can Try This Week
- OmniCoder-9B: Efficient Coding Model for 8GB GPUs
- Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
- Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
- Best Local LLM Models 2026: Developer Comparison
- 3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
- Qwodel – An Open-Source Unified Pipeline for LLM Quantization
- Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
- Apple M5 Max 128GB Benchmark Results for Local LLM Inference
- Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results
- SK Hynix Develops 1c LPDDR6 DRAM to Boost On-Device AI Performance in Mobile Devices
- Mnemos: Persistent Memory System for Local AI Agents
- 8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems
- HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026?
- FreeBSD 14.4 Released: Implications for Local LLM Deployment
- Qwen 3.5 Derestricted Model Available for Local Deployment
- How to Run Your Own Local LLM — 2026 Edition
- Engram – Open-Source Persistent Memory for AI Agents
- Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
- Mojo: Creating a Programming Language for an AI World with Chris Lattner
- Show HN: Asterode – Multi-Model AI App with Memory and Power Features
- The Emerging Role of SRAM-Centric Chips in AI Inference
- Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs
- How to Run High-Performance LLMs Locally on the Arduino UNO Q
- Nummi – AI Companion with Memory and Daily Guidance
- Unsloth Dynamic 2.0 GGUFs
- Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second
- LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware
- Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080
- Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080
- Running LLMs on Raspberry Pi and Edge Devices: A Practical Guide
- Researchers Develop Persistent Memory System for Local LLMs—No RAG Required
- Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals
- Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods
- What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup
- Which Web Frameworks Are Most Token-Efficient for AI Agents?
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation
- The Complete Stack for Local Autonomous Agents: From GGML to Orchestration
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- O-TITANS: Orthogonal LoRA Framework for Gemma 3 with Google TITANS Memory Architecture
- Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System
- Running Local LLMs and VLMs on Arduino UNO Q with yzma
- Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
- Local Vision-Language Models for Document OCR and PII Detection in Privacy-Critical Workflows
- LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM
- InitRunner: YAML-Based AI Agent Framework with RAG and Memory
- Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x
- MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities
- GPT-OSS 120B Uncensored Model Released in Native MXFP4 Precision
- Context Management Identified as Real Bottleneck in AI-Assisted Coding
- Switching From Ollama and LM Studio to llama.cpp: Performance Benefits
- Ring-1T-2.5 Released with SOTA Deep Thinking Performance
- MiniMax M2.5: 230B Parameter MoE Model Coming to HuggingFace
- Ming-flash-omni-2.0: 100B MoE Omni-Modal Model Released
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- Running Your Own AI Assistant for €19/Month: Complete Self-Hosting Guide
- Heaps Do Lie: Debugging a Memory Leak in vLLM
- Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine
- Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance
- Energy-Based Models Compared Against Frontier AI for Sudoku Solving
- Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data