Tagged "llama"
-
Picking Your First Local LLM Is Easier Than the Internet Makes It Sound
-
NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model
-
Llama.cpp Runs on SGI Power Challenge from 1995 with MIPS R8000 Kernel
-
Grokfeed: Terminal Feed Reader for HN, Reddit, and Lobste.rs Using Claude Code
-
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
-
Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp
-
An Update on GitHub Availability: Infrastructure Lessons for Hosted LLM Tools
-
Unsloth's Custom Kernels Make LLM Fine-Tuning Viable on Consumer GPUs
-
Linux Crushes Windows on llama.cpp Inference by Double Digits
-
Run a Local LLM Server on Raspberry Pi with Remote Access Capabilities
-
I Replaced My Local LLM With a Model Half Its Size and Got Better Results
-
Using a Local LLM as a Zero-Shot Classifier
-
Building Real-World On-Device AI with LiteRT and NPU
-
Llama 4 Scout on MLX: The Complete Apple Silicon Guide (2026)
-
Intel OpenVINO 2026.1 Integrates llama.cpp with Wildcat Lake and Arc Pro B70
-
Llama.cpp's Auto Fit Feature Quietly Reshapes Local AI Inference on Consumer Hardware
-
The Open-Source AI Ecosystem Keeps Treating llama.cpp Like a Second-Class Citizen
-
Malicious GGUF Models Could Trigger Remote Code Execution on SGLang Servers
-
llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost
-
Bun v1.3.13
-
AI Quota Inflation Is No Token Effort. It's Baked In
-
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
-
LlaMa.cpp Robot Wars
-
Kilo is the VS Code Extension That Actually Works with Every Local LLM
-
Unweight: Lossless MLP Weight Compression for LLM Inference
-
Show HN: I Can't Write Python. It Works Anyway – Local LLM Automation
-
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
-
Kilo Is the VS Code Extension That Actually Works With Every Local LLM I Throw at It
-
The 'Ollama' Tool Has Numerous Problems, and Some Argue That Llama.cpp Is Better
-
ChatMCP – Connect your AI browser chats to your coding agents
-
Project Glasswing and the ASF: Open-Source's Chance to Win the AI Era
-
Dynamic Expert Cache in llama.cpp Achieves 27% Faster Inference on Large MoE Models
-
DotLLM – Building an LLM Inference Engine in C#
-
Sovereign AI: Why the Next GPT Will Be Born in Our Living Rooms
-
Qwen 3.5 Small – On-Device Multimodal Models Released
-
Developer Shares Golden Stack for Local Coding Assistant Integration Directly Inside Code Editors
-
Copilot Rate-Limiting Issues Highlight Cloud AI Service Limitations
-
Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
-
Self-Hosted LLM Took Personal Knowledge Management System to the Next Level
-
Qwen3 Audio and Vision Support Now Available in llama.cpp
-
MiniMax M2.7 Open-Sources Globally as Industry's First Self-Improving Model
-
Audio Processing Support Lands in llama.cpp with Gemma-4
-
Running Same Prompts Through Claude and Local LLM Revealed Unexpected Results
-
ASUS Malaysia to Bring UGen300 USB AI Accelerator in Q2 for Portable On-Device AI Inferencing
-
Users Report Significant Performance Improvements After Migrating from Ollama to llama.cpp
-
MiniMax M2.7 Is Now Open Source
-
Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
-
Google's Gemini Nano 4 Offers Faster, Smarter Local Inference Capabilities
-
Tether Launches QVAC SDK for Cross-Platform Local AI Development
-
Ollama's Limitations for Production Local LLM Deployments
-
Gemma 4 Template Improvements Enhance Tool Use and Dialog Compliance
-
Speculative Decoding Made My Local LLM Actually Usable
-
Ollama is Still the Easiest Way to Start Local LLMs, But It's the Worst Way to Keep Running Them
-
Gemini-CLI, Llama.cpp, and Qwen3.5 Running on NVIDIA Jetson TK1
-
Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support
-
Gemma 4 Support Stabilized in Llama.cpp
-
EXAONE 4.5 33B Model Released with Multiple Quantization Formats
-
LiteLLM Integrates with Ollama to Simplify Running 100+ Models Locally
-
MemPalace, the Highest-Scoring AI Memory System Ever Benchmarked
-
TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration
-
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
-
GPU Memory for LLM Inference (Part 1)
-
Google AI Edge Gallery Tops App Store Charts with On-Device Gemma 4
-
Vektor – Local-First Associative Memory for AI Agents
-
Unpaved: Audit Toolkit for AI Developer Tool Bias in Global South Contexts
-
Qwen 3.6 Free Model Available via OpenRouter
-
Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups
-
Microsoft Quantum Development Kit Ported to Rust: 100x Faster and Smaller
-
Apple Research Shows Self-Distillation Significantly Improves Local Code Generation
-
GPUs vs. TPUs: Decoding the Powerhouses of AI
-
Gemma 4 KV Cache Memory Issues Fixed in llama.cpp
-
OpenUMA – Apple-Style Unified Memory for x86 AI Inference
-
VRAM Optimization Technique Cuts Gemma 4 Memory Usage by 3x
-
Google Gemma 4 Released with GGUF Quantizations
-
Gemma 4 2B Successfully Runs on Raspberry Pi 5
-
SmolLM2-360M Running on Samsung Galaxy Watch 4 with 74% Memory Reduction
-
Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support
-
Intel's $949 GPU Has 32GB of VRAM for Local AI, but Software is Why Nvidia Keeps Winning
-
Show HN: Extra-Platforms, Python Library to Detect OS, Arch, Shell, CI, AI
-
ROCm Integration in Ubuntu 26.04 Advances Linux GPU Inference
-
Local AI Ecosystem Extends Far Beyond Ollama
-
Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains
-
Gemini CLI – Open-Source AI Agent for Terminal Integration
-
Claude Code Source Leaked: Community Extracts Multi-Agent Orchestration Framework
-
PrismML Announces 1-Bit Bonsai: First Commercially Viable 1-Bit LLMs
-
Samsung launches Galaxy Book6 series in India with Nvidia RTX 5070 graphics and on-device AI
-
Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning
-
Closed Source AI = Neofeudalism
-
DeepSeek V3 Complete Guide: Deploy and Optimize Local AI in 2026
-
Local AI Ecosystem Extends Far Beyond Ollama
-
Unsloth Studio Beta Ships 50+ New Features for Local Model Training and Inference
-
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
-
Introduction to Nyreth v1.0
-
HP Launches Copilot+ PCs in India with On-Device AI Capabilities for Local Inference
-
GLM-5.1 Model Weights Launching Early April for Local Deployment
-
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
-
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
-
Coding Implementation to Run Qwen3.5 Reasoning Models Distilled With Claude-Style Thinking Using GGUF and 4-Bit Quantization
-
Quantization Reveals Outliers Impacting LLM Accuracy
-
Homelab Consolidation: Replacing 3 Models with Single 122B MoE Model on AMD Ryzen AI MAX+
-
Pluggable's TBT5-AI: First Thunderbolt Dock Explicitly Targeting Local LLM Workstations
-
Nota AI and SiMa.ai Partner on Physical AI Technology for Local Deployment
-
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
-
Apple Plans Slimmed-Down Gemini Models for Local iPhone AI Features
-
Show HN: Open Agent Spec – Treat AI Agents Like Typed Functions, Not Prompt Chains
-
OmniCoder v2 Released: Improved Code Generation for Local Deployment
-
Private Brain LLM Setup on Windows PC Eliminates Need for Paid Cloud Services
-
Researcher Successfully Runs Local LLMs on Legacy "Dead" GPU With Surprising Results
-
Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared
-
I built Rubric, an open source Sentry for AI. Looking for beta testers
-
Qwen 3.5 Models: Optimal Settings and Reduced Overthinking Configuration
-
Llama.cpp ROCm 7 vs Vulkan Performance Benchmarks on AMD Mi50
-
Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives
-
Rust Project Perspectives on AI
-
Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations
-
Setting Up a Private AI Brain on Windows: Complete Guide to Local LLM Deployment
-
Nvidia Nemotron Cascade 2 30B Emerges as Powerful Alternative to Qwen Models
-
Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting
-
ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B
-
Careless Whisper – Personal Local Speech to Text
-
Automating Read-It-Later Workflows with Local LLMs for Overnight Summarization
-
Qwen 3.5 397B emerges as top-performing local coding model
-
Qualcomm and Samsung's 30-Year AI Alliance Enters a New Phase as On-Device AI Chip Race Heats Up
-
Apple M5 Max 128GB real-world performance benchmarks for local inference
-
Cursor's Composer 2 model attribution dispute highlights open-source licensing concerns
-
What AI Augmentation Means for Technical Leaders
-
Ultra-Compact 28M Parameter Models Show Promise for Specialized Domain Tasks
-
Qwen 3.5 Emerges as Top Performer for Local Deployment with Extensive Quantization Options
-
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
-
NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor
-
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
-
Kilo Is the VS Code Extension That Actually Works With Every Local LLM I Throw At It
-
Unsloth Studio: Open-Source Web UI for Training and Running LLMs Locally
-
On-Device AI: Tether's QVAC Fabric Enables Local Training
-
MiniMax-M2.7: New Compact Model Announced for Local Deployment
-
I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since
-
LucidShark – Local-first, open-source quality and security gate
-
You're Using Your Local LLM Wrong If You're Prompting It Like a Cloud LLM
-
Hugging Face Releases One-Liner for Automatic Hardware Detection and Model Selection
-
Run LLMs Locally with Llama.cpp
-
I Ran Local LLMs on a 'Dead' GPU, and the Results Surprised Me
-
Mistral Releases Small 4 Open-Source Model Under Apache 2.0
-
Local Qwen Models Master Browser Automation Through Iterative Replanning
-
How I Used Lima for an AI Coding Agent Sandbox
-
Researcher Discovers Universal "Danger Zone" in Transformer Model Architecture at 50% Depth
-
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
-
Practical Fix for Qwen 3.5 Overthinking in llama.cpp
-
Qwen 3.5 122B Demonstrates Exceptional Reasoning for Local Deployment
-
Open-Source LLMs Rapidly Displacing Proprietary SOTA Models
-
OmniCoder-9B: Efficient Coding Model for 8GB GPUs
-
NVIDIA Updates Nemotron 3 122B License, Removes Deployment Restrictions
-
This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference
-
Apple's On-Device AI Raises Privacy Alarms Across British Parliament
-
AMD Declares 'AI on the PC Has Crossed an Important Line' – Agent Computers as Next Breakthrough
-
Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
-
OpenClaw vs Eigent vs Claude Cowork: Comparing Open-Source AI Collaboration Platforms
-
Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference
-
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
-
AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
-
Intel OpenVINO Backend Support Now Available in llama.cpp
-
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
-
How to Run Local LLMs in 2026: The Complete Developer's Guide
-
Fine-Tuned 14B Model Outperforms Claude Opus 4.6 on Ada Code Generation
-
AgentArmor: Open-Source 8-Layer Security Framework for AI Agents
-
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
-
Runpod Report: Qwen Has Overtaken Meta's Llama As The Most-Deployed Self-Hosted LLM
-
Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models
-
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
-
Nvidia Releases Nemotron 3 Super: 120B MoE Model for Local Deployment
-
Comprehensive MoE Backend Benchmarks for Qwen3.5-397B: Real Numbers vs Hype
-
Local AI Coding Assistant: Complete VS Code + Ollama + Continue Setup
-
Llama.cpp Adds True Reasoning Budget Support
-
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
-
Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results
-
SK Hynix Completes Qualification for LPDDR6 Memory Optimized for AI Inference
-
Sarvam Open-Sources 30B and 105B Reasoning Models
-
Simple Layer Duplication Technique Achieves Top Open LLM Leaderboard Performance
-
NVIDIA Jetson Brings Open Models to Life at the Edge
-
LMF – LLM Markup Format
-
Llama.cpp Celebrates Major Milestone: From Leak to Industry Standard
-
Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
-
Mnemos: Persistent Memory System for Local AI Agents
-
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems
-
HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026?
-
FreeBSD 14.4 Released: Implications for Local LLM Deployment
-
Fine-Tuned Qwen SLMs (0.6–8B) Demonstrate Competitive Performance Against Frontier LLMs on Specialized Tasks
-
M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference
-
Community Survey: AI Content Automation Stacks in 2026
-
Strix Halo (Ryzen AI Max+ 395) Achieves Strong Local Inference Performance with ROCm 7.2
-
Sarvam Open-Sources 30B and 105B Reasoning Models
-
Qwen 3.5 Derestricted Model Available for Local Deployment
-
Reverse engineering a DOS game with no source code using Codex 5.4
-
Qwen 3.5 27B Achieves Strong Local Inference Performance
-
OpenSpec: Spec-driven development (SDD) for AI coding assistants
-
Benchmark: Local Open-Source LLMs Competitive in Real-Time Trading Applications
-
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
-
HP Refreshes Lineup with AI-Focused Workstations
-
ETH Zurich Research Challenges Context-Length Assumptions in LLM Agents
-
Qwen3-Coder-Next Achieves Top Ranking on SWE-bench at Pass@5
-
Open WebUI Adds Native Terminal Tool Calling with Qwen3.5 35B Support
-
Llama.cpp Merges Automatic Parser Generator to Mainline
-
Turning Your Linux Terminal into a Local AI Assistant
-
llama-swap Emerges as Superior Alternative to Ollama and LM-Studio
-
llama.cpp Merges Agentic Loop and MCP Client Support
-
Apple Unveils MacBook Pro with M5 Pro and M5 Max Featuring On-Device AI
-
Qwen 3.5-35B-A3B Achieves 37.8% on SWE-bench Verified Hard
-
OpenWrt 25.12.0 – Stable Release
-
Quantifying Cost Savings with Local LLMs for Development
-
Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI
-
Apple M5 Pro and M5 Max: 4× Faster LLM Processing
-
AMD Launches Copilot+ Desktop Chips to Compete in On-Device AI Market
-
ÆTHERYA Core – Deterministic Policy Engine for Governing LLM Actions
-
Qwen 3.5 Small Models Released: 0.8B to 9B Parameters Optimized for On-Device Inference
-
Qwen 3.5 0.8B Successfully Deployed on 7-Year-Old Samsung S10E Using llama.cpp
-
Framework Choice Critical: llama.cpp and vLLM Outperform Ollama for Qwen 3.5 Testing
-
Critical: Qwen 3.5 Requires BF16 KV Cache, Not FP16 for Accurate Inference
-
GitDelivr: A Free CDN for Git Clones Built on Cloudflare Workers and R2
-
C7: Pipe Up-to-Date Library Docs Into Any LLM From the Terminal
-
Huawei's SuperPoD Portfolio Creates New Option for Global Computing at MWC Barcelona 2026
-
4 Free Tools to Run Powerful AI on Your PC Without a Subscription
-
Unsloth Dynamic 2.0 GGUFs
-
5 Useful Docker Containers for Agentic Developers
-
Seco Launches Edge AI System-on-Module at Embedded World 2026
-
Arduino and Qualcomm Bring On-Device AI Learning to Indian Schools
-
Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti
-
Qwen 3.5 Underperforms on Hard Coding Tasks—APEX Benchmark Analysis
-
Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup
-
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required
-
DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
-
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
-
Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
-
Qwen3.5-27B Identified as Sweet Spot for Mid-Range Local Deployment
-
Mirai Announces $10M to Advance On-Device AI Performance for Consumer Devices
-
How AI is Redefining Price and Performance in Modern Laptops
-
Show HN: A Ground Up TLS 1.3 Client Written in C
-
Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers
-
Apple Accelerates U.S. Manufacturing with Mac Mini Production
-
Anthropic Has Never Open-Sourced an LLM: Implications for Local Deployment Strategy
-
Anthropic Reveals Industrial-Scale Distillation Attacks by Chinese AI Labs
-
Comparing Manual vs. AI Requirements Gathering: 2 Sentences vs. 127-Point Spec
-
Show HN: Agora – AI API Pricing Oracle with X402 Micropayments
-
nanollama: Open-Source Framework for Training Llama 3 from Scratch with One-Command GGUF Export
-
Open-Source llama.cpp Finds Long-Term Home at Hugging Face
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
Ouro 2.6B Thinking Model GGUFs Released with Q8_0 and Q4_K_M Quantization
-
Strix Halo Performance Benchmarks: Minimax M2.5, Step 3.5 Flash, Qwen3 Coder
-
I Thought I Needed a GPU to Run AI Until I Learned About These Models
-
Open-Source + AI: ggml Joins Hugging Face, llama.cpp Stays Open—Local AI's Long-Term Home
-
GGML.AI Acquired by Hugging Face
-
SanityBoard Adds 27 New Model Evaluations Including Qwen 3.5 Plus, GLM 5, and Gemini 3.1 Pro
-
PaddleOCR-VL Now Integrated into llama.cpp for Multilingual OCR
-
Kitten TTS V0.8 Released: New State-of-the-Art Super-Tiny TTS Model Under 25 MB
-
Free ASIC-Accelerated Llama 3.1 8B Inference at 16,000 Tokens/Second
-
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
-
Kitten TTS V0.8 Released: State-of-the-Art Super-Tiny Text-to-Speech Model Under 25MB
-
Self-Hosted AI: A Complete Roadmap for Beginners
-
Meet Sarvam Edge: India's AI Model That Runs on Phones and Laptops With No Internet
-
Qwen 3.5-397B-A17B Now Available for Local Inference with Aggressive Quantisation
-
Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter
-
Ask HN: What is the best bang for buck budget AI coding?
-
Switching From Ollama And LM Studio To llama.cpp: A Performance Comparison
-
SnowBall Technique Addresses Context Window Limitations in Local LLMs
-
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
-
NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x
-
MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities
-
MiniMax-M2.5 230B MoE Model Released with GGUF Support for Local Deployment
-
LLaDA2.1 Introduces Token Editing for Massive Speed Gains in Local Inference
-
GPT-OSS 120B Uncensored Model Released in Native MXFP4 Precision
-
GNOME's AI Assistant Newelle Adds llama.cpp Support and Command Execution
-
Context Management Identified as Real Bottleneck in AI-Assisted Coding
-
Switching From Ollama and LM Studio to llama.cpp: Performance Benefits
-
Optimal llama.cpp Settings Found for Qwen3 Coder Next Loop Issues
-
GitHub Announces Support for Open Source AI Project Maintainers
-
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
-
Student Releases Dhi-5B: Multimodal Model Trained for Just $1,200
-
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
-
New Header-Only C++ Benchmark Tool for Predictive Models on Raw Binary Streams
-
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance