SmolLM2-360M Running on Samsung Galaxy Watch 4 with 74% Memory Reduction

1 min read

A developer successfully deployed SmolLM2-360M on a Samsung Galaxy Watch 4 with only 380MB of available RAM, achieving a 74% reduction in peak memory usage through careful optimization of llama.cpp's memory model. The key breakthrough involved eliminating duplicate in-memory copies of model weights caused by mmap page cache and tensor allocation conflicts.

This breakthrough pushes the boundaries of edge inference into territory previously thought impossible—wearable devices with meaningful natural language capabilities. The optimization approach involved deep understanding of both the operating system's memory management and llama.cpp's tensor allocation patterns, suggesting these techniques could benefit other constrained environments beyond watches.

For practitioners deploying to IoT and wearable devices, this demonstrates that with proper optimization, even 360M parameter models can run viably on extremely memory-constrained platforms. The methodology could have broader applications across edge deployment scenarios where memory pressure is the primary bottleneck.


Source: r/LocalLLaMA · Relevance: 8/10