Projects OfflineAI

OfflineAI

A production-grade offline-first iOS app running small language models completely on-device, with privacy-first design and intelligent resource management.

What it is

OfflineAI is a privacy-first iOS app that runs small language models completely offline. There are no cloud dependencies and no API costs, while still supporting a usable, fast experience on real devices.

Features

  • 100% offline: all inference runs locally on device.
  • Zero API costs: no cloud dependencies.
  • Complete privacy: AES-256 encryption and no telemetry.
  • Intelligent resource management: dynamic quantisation (4-bit/8-bit), on-demand model downloads, LRU model caching with automatic unloading, memory pressure monitoring, and battery-aware processing.
  • Context management: semantic chunking with embedding-based relevance.
  • Offline-first sync (optional): optional encrypted cloud sync.

Architecture

The app is split into a few core components that keep inference reliable under real-world constraints (memory, battery, and device performance variability).

Model management

Lazy loading with an LRU cache, automatic unloading on memory pressure, preloading during idle + charging, and support for multiple quantisation levels.

Device profiling

Memory detection/monitoring, battery state tracking, automatic model selection, and performance benchmarking.

Inference engine

Streaming token generation, batch processing for efficiency, and embedding generation.

Data layer

SwiftData local persistence with AES-256-GCM encryption and per-conversation encryption keys.

Supported models

Phi-3-Mini (recommended)

  • Parameters: 3.8B
  • Context: 4096 tokens
  • Quantisations: Q4_0 (~2.1 GB), Q8_0 (~3.9 GB)
  • Best for: modern devices (iPhone 12+)

TinyLlama (fallback)

  • Parameters: 1.1B
  • Context: 2048 tokens
  • Quantisation: Q4_0 (~0.6 GB)
  • Best for: older devices or low memory

What I would demo

  1. On-device model selection (Phi-3 vs TinyLlama)
  2. Quantisation switching (4-bit/8-bit) based on device profile
  3. Memory pressure handling and LRU cache eviction
  4. Encrypted local storage for conversations

Next