If you build a beautiful AI system that only works on a high-speed fibre connection and a big GPU, you haven’t built for Kenyan primary care.
Designing Afya-Yangu AI means designing for the edge: low-power devices, intermittent connectivity, and busy clinics where every second counts.
The Constraints We Have to Respect
In many level 2 and 3 facilities, the technical reality looks like this:
- Power cuts are common; backup options are limited.
- Connectivity is via 3G/4G bundles or shared Wi-Fi, not dedicated fibre.
- Available hardware may be:
- A modest desktop with 4–8 GB RAM
- A low-end server donated years ago
- A rugged tablet shared across several rooms
If our AI assistant depends on a large cloud model and stable internet, clinicians will use it once, get frustrated, and never open it again.
So we made two key design decisions:
- Use a small but capable model (MedGemma-based SLM).
- Perform retrieval and inference locally using FAISS and on-device compute.
Why Small Models Matter
Large language models are powerful, but they come with trade-offs:
- They need more compute → slower responses on small machines.
- They’re harder to run offline → more dependence on cloud.
A small model, carefully chosen and fine-tuned, gives us:
- Lower latency – answers in seconds, not minutes.
- Feasibility on local hardware – no expensive GPUs required.
- Better control – easier to package, ship, and update.
We’re not chasing flashy benchmarks. We’re optimising for “Does this work reliably in a busy clinic on Tuesday morning?”
FAISS for Fast Local Search
FAISS helps us store our guideline knowledge base in a way that’s:
- Compact enough for local disks.
- Fast enough for real-time search.
Because we only retrieve a handful of relevant chunks for each query, we keep memory and compute usage low—which is exactly what we need on the edge.
Trade-offs: Latency vs Accuracy, Size vs Coverage
Every design choice involves compromise. Some of the trade-offs we’re navigating:
- Smaller vs larger model:
- Smaller → faster, cheaper, easier to deploy.
- Larger → potentially more nuanced language understanding.
Our bias is towards “small enough to run everywhere, good enough to be safe and useful.” - Latency vs complexity:
- More retrieval steps and checks could improve answer quality.
- But each extra step adds time.
We aim for answers in <5 seconds under normal loads. - On-device vs cloud hybrid:
- Full offline mode is essential for many sites.
- But when connectivity exists, we might allow optional cloud enhancements (e.g. syncing logs, model updates).
Possible Edge Architectures
We’re exploring a few deployment patterns:
- Local Server in the Facility
- A small box in the records or IT room.
- Multiple devices on the local network can connect to it via web interface.
- Rugged Tablet or “Clinic Box”
- All-in-one device with the model, FAISS index, and UI.
- Ideal for facilities without any other computer infrastructure.
- Hybrid Mode
- Primary inference on-device.
- Occasional sync with cloud for updates, analytics, and backup.
The goal is to avoid a brittle system that dies when the internet drops. Afya-Yangu AI should feel like part of the clinic, not a remote service.
Keeping the System Up-to-Date
Offline doesn’t mean frozen.
We’re designing an update pathway where:
- New or revised guidelines are packaged into update bundles.
- These can be:
- Downloaded when connectivity is available, or
- Physically distributed on USB drives, if needed.
- The local system:
- Updates the guideline corpus.
- Rebuilds the FAISS index.
- Logs what changed, so we can trace behaviour.
That way, clinicians get the benefits of offline reliability and can stay aligned with evolving national guidance.
Afya-Yangu AI at the edge is a work in progress—but the principle is clear:
If it can’t run where patients are seen, it doesn’t count as “real” clinical AI.

