Tagged "llm-deployment"
- Self-Hostable AI Agents and Internal Software Framework Released
- Qt 6.11 Released with Enhanced Cross-Platform Deployment Capabilities
- Claude Usage Monitor: Track API Usage with macOS Menu Bar App
- How to Build a Self-Hosted AI Server with LM Studio: Step-by-Step Guide
- Building a Production AI Receptionist: Practical Local LLM Deployment Case Study
- Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations
- Developer Builds Fully Local Multi-Agent System Using vLLM and Parallel Inference
- Multi-Token Prediction support coming to MLX-LM for Qwen 3.5
- Apple M5 Max 128GB real-world performance benchmarks for local inference
- What AI Augmentation Means for Technical Leaders
- Llamafile 0.10 Released with GPU Support and Rebuilt Core
- My Dinner with AI
- Mamba 3: State Space Model Architecture Optimized for Inference
- I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since
- Auto-retry Claude Code on subscription rate limits (zero deps, tmux-based)
- How I Used Lima for an AI Coding Agent Sandbox
- How AI Agents Should Pay for API Calls: X402 and USDC Verification on Base
- LoKI – Local AI Assistant for Linux and WSL
- Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
- Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
- AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
- P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
- Local LLMs on Apple Silicon Mac 2026: M1 M2 M3 Guide
- Show HN: Intake API – An Inbox for AI Coding Agents
- How to Run Local LLMs in 2026: The Complete Developer's Guide
- Show HN: Bots of WallStreet – Multi-Agent Debate and Prediction Framework
- Best Local LLM Models 2026: Developer Comparison
- AgentArmor: Open-Source 8-Layer Security Framework for AI Agents
- Show HN: VmExit – An Experiment in AI-Native Computing
- MeepaChat – Slack for AI Agents (iOS, macOS, Web / Cloud, Self-Hosted)
- The $1,500 Local AI Setup: DeepSeek-R1 on Consumer Hardware
- A Kubernetes Operator That Orchestrates AI Coding Agents
- Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
- M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference
- Qwen 3.5 Derestricted Model Available for Local Deployment
- Nota AI to Showcase End-to-End On-Device AI Optimization at Embedded World 2026
- How to Run Your Own Local LLM — 2026 Edition
- Qwen 3.5 27B Achieves Strong Local Inference Performance
- Show HN: Proxly – Self-hosted tunneling on your own domain in 60 seconds
- Show HN: SimplAI – Build and Deploy AI Agents and Workflows Without Boilerplate
- Llama.cpp Merges Automatic Parser Generator to Mainline
- IBM Granite 4.0 1B Speech Model Released for Multilingual Speech Recognition
- Qwen 3.5-4B Generates Fully Functional OS in Single Prompt
- Qwen 3.5-35B-A3B Achieves 37.8% on SWE-bench Verified Hard
- Qwen 3.5-27B Q4 Quantization Comparison and Analysis
- Apple M5 Pro and M5 Max: 4× Faster LLM Processing
- VibeWhisper – macOS Voice-to-Text with 100% Local Processing Option
- Qwen 3.5 0.8B Successfully Deployed on 7-Year-Old Samsung S10E Using llama.cpp
- GitDelivr: A Free CDN for Git Clones Built on Cloudflare Workers and R2
- Browser Use vs. Claude Computer Use: Comparing Agent Automation Frameworks
- Huawei's SuperPoD Portfolio Creates New Option for Global Computing at MWC Barcelona 2026
- Serve Markdown to LLMs from your Next.js app
- Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080
- 5 Useful Docker Containers for Agentic Developers
- Extracting 100K Concepts from an 8B LLM
- 5 Useful Docker Containers for Agentic Developers
- Show HN: AgentGate – Stake-Gated Action Microservice for AI Agents
- Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup
- LM Studio vs Ollama: Complete Comparison
- The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production
- Red Hat Launches AI Enterprise for Hybrid AI Deployments
- Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
- Qwen3.5 Series Releases Comprehensive Model Lineup Across All Tiers
- The Real AI Competition Is Closed-Source vs Open-Source, Not America vs China
- Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers
- How Do You Know Which SKILL.md Is Good?
- Massu: Governance Layer for AI Coding Assistants with 51 MCP Tools
- Gix: Go CLI for AI-Generated Commit Messages
- FORTHought: Self-Hosted AI Stack for Physics Labs Built on OpenWebUI
- The Complete Stack for Local Autonomous Agents: From GGML to Orchestration
- Google Is Exploring Ways to Use Its Financial Might to Take on Nvidia
- The Path to Ubiquitous AI (17k tokens/sec)
- Ollama Production Deployment: Docker-Compose Setup Guide
- Mirai Secures $10M to Optimize On-Device AI Amid Cloud Cost Surge
- Kitten TTS V0.8 Released: New State-of-the-Art Super-Tiny TTS Model Under 25 MB
- Alibaba's Qwen3.5-397B Achieves #3 Position in Open Weights Model Rankings
- Cohere Releases Tiny Aya: Efficient 3.3B Multilingual Model for 70+ Languages
- Switching From Ollama And LM Studio To llama.cpp: A Performance Comparison
- Running Your Own AI Assistant for €19/Month: Complete Self-Hosting Guide
- ByteDance Releases Seedance 2.0 AI Development Platform
- OpenClaw with vLLM Running for Free on AMD Developer Cloud
- Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts