Context Eats Models for Breakfast

AI models are already good enough — and everything else is the hard part

Mar 10, 2026

Apologies for the long hiatus. I have realized that my usual long-form posts take too long and cause too much friction to research, making them difficult to ship on a regular cadence. I will instead prioritize these “field-note” type posts more frequently. Thanks for your patience and encouragement.

Last month, I sat in on the final demo day of Turn.io’s Health AI Accelerator. Ten organizations — spanning from telehealth support in Pakistan, vision screening in South Africa, chronic disease management in Nigeria, to maternal health in Kenya — had spent months building AI-powered health services on WhatsApp. Different countries, different conditions, different languages. All trying to make AI useful for real patients in real world settings.

After all that work, they converged on the same conclusion: “Prompts, tone, discipline, product constraints, negative prompts >> model cleverness.”

Not a single team said they needed a better model. Every single team said they needed better everything else.

I keep hearing variations of this from every implementer I work with and talk to. And it’s made me realize something that I think the global health AI community needs to reckon with: the model debate is over. The bottleneck has shifted. But many of us haven’t noticed yet.

Thanks for reading Rubayat Khan for ∞ Endless Health! This post is public so feel free to share it.

The myths that won’t die

The AI-for-health conversation is still dominated by questions about model capabilities and externalities. Which model is most clinically accurate? Are they going to be equitable in low-resource contexts and languages? Shouldn’t we be worried about hallucinations and bias? What about energy and water use? These are reasonable questions, and for the most part, they’re no longer the right ones.

Because, firstly, the evidence is in. A comprehensive landscape review published in January 2026 by a Harvard-Stanford collaboration (the ARISE network) found that frontier LLMs now match or exceed physicians on structured diagnostic tasks across multiple studies. The review used the word “superhuman” to describe AI performance in several clinical domains. OpenEvidence hit 100% on the USMLE! Google’s AMIE matched primary care physicians on multi-visit disease management scenarios.

Secondly, hallucinations have dropped precipitously with the latest models (0.7-1.5% for GPT-5 according to one source), and is further mitigated with extended thinking modes and the right “harnesses” or context engineering (like RAG, “Skills”, long-term memory, iterative feedback loops, etc. — in a future post, I'll dig into what these "context harnesses" actually look like in practice, how this might be applied to various global health use cases and contexts, and what my own experience building AI workflows has taught me about where the real leverage is).

Thirdly, the cost and environmental footprint per unit of AI intelligence is falling dramatically with every generation. (33x reduction in energy use in one year | each gemini prompt is one-one-millionth of your daily water use)

tl;dr: The models are good enough, and the externalities are within acceptable limits. What isn’t good enough yet is everything we wrap around them. As the ARISE study itself summarized, prospective studies remain scarce and workflow integration is the binding constraint.

Three stories

Voice transcription based reasoning in the Philippines

A well-known tech entrepreneur I spoke with in 2024 worked with Harvard researchers to deploy an AI system for community health workers in the rural Philippines. The setup was simple and elegant: health workers recorded patient encounters as WhatsApp voice memos. Whisper transcribed the audio — nearly flawlessly even at that time, and despite background noise and colloquial Tagalog. GPT-4 analyzed the transcription and flagged clinical errors. The entire codebase was about 3,500 lines. The cost? Data transfer was more expensive than the AI processing itself!

The system was designed to catch the 25% most egregious clinical errors. It did not try to be comprehensive. It was passive, non-threatening, and focused on the worst mistakes. They learned through trial and error that processing directly on the local language worked better than translating to English first. None of these were model decisions. They were design choices — about where AI enters the workflow, what it’s designed to flag, and how it communicates to health workers who didn’t ask to be monitored.

Pro-forma templates to guide clinical history taking (West Bengal)

Our grantees, Heal India and the Liver Foundation, are currently deploying an AI clinical decision support tool with nurses in rural West Bengal. After extensive user-centered design, they landed on a workflow that addressed an inherent limitation in large language models; that they do not gather patient history comprehensively before providing the diagnostic analysis. To address this, their nurses generate a patient history template using a top AI model, then record the patient interview as they follow the checklist comprehensively, then send the recording with detailed patient history to a carefully prompted Gemini “gem”. Within 60 seconds — even from local Bengali dialects, even in noisy clinical settings — they get back a structured case summary, syndrome-based differentials, and a sophisticated treatment plan.

The model was table stakes. What took months was the workflow design. Figuring out that the entry point should be a structured pro-forma. That recording works better than typing. That 60 seconds is an acceptable latency threshold. That nurses, not ASHA workers, are the right users because they have the minimum medical literacy to act on the output. Most remarkably, the nurses have now started using the AI to create their own training materials — snakebite treatment protocols, for example — without being asked. The tool became a platform for local knowledge creation, not just a diagnostic aid.

Jacaranda’s use of contextual data for fine-tuning (Kenya)

Jacaranda Health built Uliza Mama, a maternal health chatbot that serves Kenyan mothers in Swahili, English, and code-mixed language. They built a 3,000-question benchmark from 300,000 real user interactions. And yes, their custom fine-tuned models outperformed general LLMs.

But the lesson here is not that off-the-shelf models are bad. It is that fine-tuning was effective because of context. Jacaranda’s advantage was 300,000 real conversations with Kenyan mothers — their fears, their language patterns, the clinical pathways that actually exist in Kenyan maternal health. Early versions of the chatbot based on off-the-shelf models produced outputs Jacaranda’s team described as “schizophrenic” and “terrifying.” Months of work on safety mechanisms, cultural sensitivity, and balancing medical accuracy with accessible language followed. No model upgrade would have fixed what they were dealing with.

What “context” actually means

When Turn.io accelerator’s participant teams talked about what mattered more than model selection, they kept coming back to the same list: prompts and negative prompts (what not to say is as important as what to say); tone and cultural register (clinical precision doesn’t equal trustworthy communication); local clinical guidelines (not WHO global recommendations, but the actual drugs available and the actual referral pathways); trust architecture (disclaimers, escalation to humans, clear boundaries about what the AI can’t do); workflow design (when does the AI enter the care encounter — before, during, after? passive or active?); safety and evaluation infrastructure (hallucination detection, clinical red lines, ongoing performance monitoring, not just one-time validation); and business model alignment (for private providers, the AI has to fit how they actually make money).

This is a long list. And not a single item on it is a model capability question.

What funders and builders should do about it

If context is indeed the binding constraint, then the field needs not just open source and downsized models that can perform well on edge devices and offline; it also needs an implementation commons — shared playbooks, real-world contextual benchmarks, failure libraries, and design patterns. From Endless, we have started working on this through our partnership with The Agency Fund (TAF) on federated benchmarking (10 organizations are being funded to publish benchmarks based on real-world operational data), and with CHAI and TAF on stewarding an AI commons with other stakeholders. The Center for Global Development and TAF’s four-stage evaluation framework offers a useful structure for thinking about where in the lifecycle implementers need the most help, and their living playbook is a good design primer. I also wrote extensively about various implementation bottlenecks in this post last year.

But you don’t need centralized platforms to start. AI model companies can expose some of their internal tools for benchmarking and evaluating the entire product harness, not just models, so that implementers have better tools at their disposal. Implementers can begin by publishing what their prompt libraries and RAG pipelines and Skill harnesses look like and what didn’t work. Sharing benchmarks built from real user interaction data, the way Jacaranda shared their 3,000 questions. Documenting workflow decisions and the reasoning behind them. Publishing product cards to complement model cards. Open-sourcing the non-model layers. Rigorously evaluating the product, user experience and health impact. Philanthropic funders like us can incentivize this kind of open knowledge creation and sharing as part of our agreements with grantees.

As of 2026, all organizations are starting their journey with superhuman models with excellent clinical performance and incredible multi-modal and agentic capabilities (I would argue that OpenAI’s embarrassing public gaffe with the ChatGPT Health product is more of a context failure than a model failure). Those organizations that understand their context best will have a decisive advantage to succeed. And the faster they share what they’ve learned, the faster the whole field will move forward.

The shifting sands

We’re past the era where AI capability was the binding constraint for healthcare in low-resource settings. We’re now in the era where implementation fidelity and design wisdom are the binding constraints. The models will keep getting better on their own. The context — the prompts, the workflows, the trust, the cultural fit — only gets better when humans do the hard, unglamorous work of deploying these tools and documenting what they learn.

The question isn’t any longer whether AI is ready for global health. It’s whether global health is ready for AI.

If you found this useful, I’d love to hear from you. Are you seeing the same pattern in your work? Where do models still fail? What context challenges have been hardest to solve? Drop a comment or reply to this email.

Rob

Mar 16

Rubayat, I love your post, and couldn't agree more with your central thesis.

Not only is context the key thing to get right in order to achieve outstanding system-level performance, but it is also the thing that we as healthcare entrepreneurs can uniquely do. The AI companies are moving quickly and improving model capabilities, but they are unlikely to be in a position anytime soon to have all the relevant context both of the patient and the local health system in order to solve real problems.

This is where health entrepreneurs must focus is on building that harness. Like you said, it's not easy, and a lot of those tools are still being developed. But it is extremely exciting and I think will bode very well for improving health globally!

Rukshan

Mar 13

Great write up Rubayat, technically standpoint it’s valid. But at the same time in practical setting it’s more about safety and responsibility.

If something goes wrong in these systems and gave incorrect documentation or answer who will take the responsibility? The AI, model developer, the app developer?

A recently study found that ChatGPT health still don’t find 52% of medical emergencies.

https://www.nature.com/articles/s41591-026-04297-7

In healthcare all it takes is just one scenario to go wrong.

2 replies by Rubayat Khan

4 more comments...

Rubayat Khan for ∞ Endless Health

Discussion about this post

Ready for more?