OfflineAI - Case Study

What it is

OfflineAI is a privacy-first iOS app that runs small language models completely offline. There are no cloud dependencies and no API costs, while still supporting a usable, fast experience on real devices.

Features

100% offline: all inference runs locally on device.
Zero API costs: no cloud dependencies.
Complete privacy: AES-256 encryption and no telemetry.
Intelligent resource management: dynamic quantisation (4-bit/8-bit), on-demand model downloads, LRU model caching with automatic unloading, memory pressure monitoring, and battery-aware processing.
Context management: semantic chunking with embedding-based relevance.
Offline-first sync (optional): optional encrypted cloud sync.

Architecture

The app is split into a few core components that keep inference reliable under real-world constraints (memory, battery, and device performance variability).

Model management

Lazy loading with an LRU cache, automatic unloading on memory pressure, preloading during idle + charging, and support for multiple quantisation levels.

Device profiling

Memory detection/monitoring, battery state tracking, automatic model selection, and performance benchmarking.

Inference engine

Streaming token generation, batch processing for efficiency, and embedding generation.

Data layer

SwiftData local persistence with AES-256-GCM encryption and per-conversation encryption keys.

Supported models

Phi-3-Mini (recommended)

Parameters: 3.8B
Context: 4096 tokens
Quantisations: Q4_0 (~2.1 GB), Q8_0 (~3.9 GB)
Best for: modern devices (iPhone 12+)

TinyLlama (fallback)

Parameters: 1.1B
Context: 2048 tokens
Quantisation: Q4_0 (~0.6 GB)
Best for: older devices or low memory

What I would demo

On-device model selection (Phi-3 vs TinyLlama)
Quantisation switching (4-bit/8-bit) based on device profile
Memory pressure handling and LRU cache eviction
Encrypted local storage for conversations

Next: Fraud ML