Exactly a year ago, I wrote that health systems were headed for a “singularity” — anticipating that as people harness the remarkable power of LLM-based chatbots for their health needs, health systems would struggle to keep up, build the seamless transitions from Dr. AI to human providers and back, and stay relevant. Yet, I hardly expected this transformation to unfold as swiftly as it has.
The progress of the models in medicine has indeed been astonishing. OpenAI's o1 preview demonstrated “superhuman performance” surpassing top physicians on clinical reasoning tasks months ago; it’s successor, o3, is already faster, cheaper, and more accurate. Hallucination rates, among the biggest concerns about using LLMs in medicine, has declined dramatically to negligible levels in cutting edge models. Google’s recently released MedGemma now offers multimodal clinical reasoning (including dermatology and radiology images) in an open-source model, rivaling state-of-the-art models from mere months ago. Models are progressing so fast that research is becoming obsolete soon after being published, and new benchmarks like HealthBench are popping up monthly to extend the goalpost. While it is not easy to appreciate the improvements in medical reasoning intuitively, here’s a visual analog: AI video generation models can now turn a single text prompt into HD video clips with sound that is indistinguishable from reality—a mind-bending leap from the bizarre and muted clips of Will Smith devouring spaghetti from just two years ago.
AI adoption has similarly accelerated at a pace unprecedented in the history of technology. Extrapolating from OpenAI’s numbers indicates that close to a billion people chat with an LLM every week, with up to 10% of those conversations focused on health. My own parents, aged 73 and 66, told me recently they're talking to ChatGPT regularly about their health issues (and everything else, like gardening), with zero encouragement from me. I've been doing the same for over two years, and several friends got life-changing benefits managing chronic conditions, children’s autism, and more. The barriers to self-care have never been lower: voice input in local languages (my parents use Google keyboard’s amazing voice-to-text feature in Bangla), 24/7 availability, free to use within reasonable constraints, infinite patience and attention to detail. While LLMs are far from infallible, people seem to intuit that the average human provider is likely much worse in accuracy, and certainly can’t compete on speed and availability.
The risks of health delivery organizations — public health non-profits, national public hospitals, or private clinics and diagnostic centers — not keeping up with this revolution is manifold. Health data is being fragmented and walled inside private non-health organizations like OpenAI and Google without guardrails. National and local clinical guidelines are likely being disregarded. There are no ways to monitor and mitigate when LLMs inevitably provide harmful and even life-threatening advice (such as the mental health downward spirals ChatGPT is sometimes causing). Care from a typical pharmacy or community health worker, which is how billions of people across the world receive primary healthcare, looks laughably unsophisticated when compared to the advice one can get from an LLM from the comfort of their home — as a result, actual providers are further and further distanced from people’s healthcare, and utilization and trust in them is bound to erode further.
I therefore find it baffling that health systems are not taking this with the seriousness or urgency it deserves, and in most cases not even doing anything differently than what they did pre-ChatGPT. I have the privilege of working with some leading-edge innovators and adopters of AI through my work at Endless, such as projects with Penda Health in Kenya, E-Health Africa in Nigeria, and Dimagi in Zimbabwe, but beyond these early adopters, most non-profits and social enterprises, and especially traditional hospital and public health systems, are scarcely considering the implications and reckoning with this historic disruption.
After spending the last year talking to dozens of model developers, health organizations, hospital systems and funders of AI pilots, I've identified eleven specific (but often interrelated) barriers that explain why the AI adoption by our systems are so far behind the people they aim to serve.
Bottlenecks slowing AI adoption
To make this digestible, I've organized the barriers into three layers, going from the internal (psychological and knowledge gaps) and zooming out all the way to systemic issues.
Below is a TL;DR summary table of each of these bottlenecks and some pragmatic suggestions on how to overcome them in the near term. Subsequently, I will elaborate on each of them if you are interested to read more.
A Call for Stories: I'm collecting implementation case studies from anyone piloting AI in health settings—what you tried, what broke, what worked. If you have a story to share, hit reply or DM me on Substack or Linkedin.
Let's help each other avoid reinventing the same wheels.
A. INTERNAL BARRIERS
1. Mistrust and fear of AI, and the infamous human ego
AI is poorly defined, and even more poorly understood. Given the myriad doomsday forecasts out there, it is hard to have a focused constructive conversation with a public health professional or doctor about the true promise of AI in medicine (i.e. the democratization of clinical intelligence), and how to manage the genuine risks that inevitably accompanies that superpower.
Many doctors in the West, overburdened for decades by documentation and process-creep, are generally positive about AI’s potential to simplify life and help them focus again on patient care, partly also because they don’t see an imminent threat to their jobs under the onerous regulatory environments they operate in. But many senior clinicians in poorer countries, who also tend to lead Ministries of Health, live in this complacent bubble of overconfidence and skepticism that their craft is far beyond any computer to replace, and hence are often unable to take the disruption and opportunities seriously. As a friend at Harvard Medical School who was a leading Internal Medicine specialist in India told me, "Even the free version of 4o is better than most doctors I've encountered in developing countries." Yet decision-makers continue operating from outdated assumptions about AI capabilities.
This is also why they still over-index on AI hallucinations being a dealbreaker (which is a much smaller problem today than it was 2 years ago), declare that AI is not ready after a small pilot with the outdated GPT-3.5, and latch on to any indications that AI is not “true intelligence.” I saw a version of this when I ran Jeeon, where senior doctors in Bangladesh declined to even acknowledge the widely researched fact that 70% of healthcare in the country was delivered by the humble neighborhood drug shop.
To be fair to doctors, however, this inflated ego is a broader problem with human experts in general!
2. Limited AI Literacy and artificial self-imposed constraints
Ego and mistrust naturally prevents people from deeply and objectively engaging with the promises of AI. There are serious limitations in people’s understanding of LLMs and their capabilities, and many clinicians and public health professionals use it (when they do) just as they would Google Search, and no better.
As a result, even when organizations undertake efforts to pilot AI, they tend to get stuck within the “chatbot” paradigm. As Rob Korom from Penda Health told me, “it is very easy to build a cool chatbot by layering a 3-sentence prompt over ChatGPT”, and people tend to stop innovating beyond that novelty. But chatbots alone can rarely solve a real problem in healthcare. I will cover this issue and available design approaches in more detail in points #4 and 5 below.
On top of this, organizations often impose artificial and poorly informed constraints on themselves which lead to failure, such as choosing outdated models like GPT-3.5 or open source ones citing cost or data protection constraints. What they often fail to understand is that costs for an equivalent level of performance are coming down at 10x per year or more. Recently, only months after releasing o3-mini, OpenAI slashed API prices by 80%! Open source (e.g. DeepSeek) is also catching up fast and is probably at most 6 months behind the frontier.
By the time pilots are concluded (ideally with the best frontier models, even closed ones) and funding is secured for scale up, costs are bound to come down to an acceptable level, and open source models available that can replace the closed models without little or no performance tradeoff. But piloting a substandard model is certain to lead to underwhelming results, dampening excitement and adoption.
B. ORGANIZATIONAL BARRIERS
3. Devices and connectivity
There's an incorrect assumption that smartphones and 5G connectivity are prerequisites for AI deployment. Yet even basic setups—like those demonstrated by Viamo in Nigeria—can effectively utilize AI through innovative design:
This is therefore more of an internal mindset barrier than an actual blocker to progress. AI can be the most democratizing health technology of all times, and it is incumbent on us to take it to as many people as possible, starting with the lowest hanging fruits. We should start with facilities and providers (hundreds of thousands by this point) who already have devices and connectivity. Billions of people rely on CHWs or pharmacies who would get an instant boost in care quality if AI was used to upgrade these touch-points, even with phone/IVR/Whatsapp based solutions like Viamo’s. We should not let perfection be the enemy of the good.
4. Unclear goals and success criteria (“Hammer in search of a nail” problem)
Because AI is all the hype these days, and so much money is available for AI tests, many organizations jump into pilots without clarity about what they are trying to achieve with AI. As a result, pilots are rarely targeted or well designed to meet the goal, and fall flat after some initial excitement with the “cool” factor.
Organizations should start out their AI journeys with a clear understanding of the “nail” — for most, it will usually come down to one or more of these four goals:
Reduce costs (e.g. by automating/simplifying processes)
Increase impact (without massively increasing costs) by solving a bottleneck, e.g. that of scarcity of call center agents, clinicians or radiologists
Increase revenue (e.g. by adding new services or increasing throughput)
Improve patient or provider experience (e.g. care orchestration, ambient scribes, etc.)
5. Workflow Integration & Design Failures
Even when organizations have a clear analysis of the problem(s) to solve, and specific definitions of success, they often lack design processes or expertise to integrate the right AI in the right format to address the problem, learn from failures and iterate towards an optimal solution.
While everyone naturally gravitates towards chatbots, these rarely fit into existing health worker workflows seamlessly. If you start from the problems, however, there are actually an infinite number of ways AI can be put together to solve a problem, such as:
Ambient scribes that transcribe conversations and populate medical records, solving the key challenge of getting frontline providers to reliably input information into EHRs.
Passive monitoring agents that flag concerning patterns in routine data, helping address critical mistakes or gaps by providers or systems.
Competency-based coaching systems that assess healthcare worker competencies and provide personalized refreshers, helping address knowledge/skill gaps at a low cost.
Different models or instances can also be stitched together into integrated workflows, e.g. taking inbound messages from clients and triaging/categorizing them using Model 1, then passing to specialized sub-agents to answer different categories of queries (Models 2.x), and supporting humans in drafting responses to the few critical or unclear ones that require human involvement (Model 3).
Organizations therefore need to put their design thinking hats first and foremost, and co-design with users what AI-enabled solutions might alleviate their challenges. Early prototyping could even involve human actors pretending to be AI to gauge the right input-output combinations. Only after the problem and solution are conceptualized manually can one explore the right AI models and prompts to produce the right outputs, and design the evaluation metrics and thresholds that AI has to pass in order to constitute success and deserve scale-up.
6. Resource constraints:
Traditional health systems such as hospitals typically lack the expertise in-house to champion and build AI literacy internally, and struggle to attract such resources to their organizations even if they want to. Implementers like health non-profits and social enterprises, for their part, are often too time and resource constrained (even more so in the current aid crisis) to think creatively about AI. All of this is compounded by the fact that AI resources are scarce and extremely expensive in most markets.
One obvious answer to this is to invest in shared resources that can spread that cost across organizations — organizations like Audere, IDInsight and Turn.io are already trying to do this, as are fellowship and acceleration programs at large model developers. We need many more AI and design experts supporting non-profits and health systems to realize the true value and potential of AI, and handhold them through the journey.
7. Inertia and Change Resistance
Most large organizations are set in their ways, and health systems are no exception. Doctors and health professionals who have spent their entire formative years studying and operating within those doctrines are incredibly resistant to change and disruption, especially from something they don’t understand or trust. Even nimble organizations like Penda Health who have been on the frontiers of scaling up AI have told me that the hardest part was getting the providers to trust and use the AI guidance, not the technology innovation itself.
Even when early adopters and internal champions have successfully run pilots, it is therefore extremely important to think through the change management efforts needed for wider scale up and use. This may involve securing buy-in on success criteria and investment requirements from top leadership ahead of time, or mapping against provider pain points and co-designing with them to ensure the intervention does not languish on the proverbial shelves post-deployment.
C. SYSTEMIC BARRIERS
8. Tooling gaps for model orchestration, evaluation and more
The process for exploring, evaluating, selecting, integrating and maintaining model deployments in real world clinical and health contexts is non-trivial. Currently, in the absence of proper implementer-friendly “Dev-Ops” infrastructure, most implementers are walking this journey solo and unsupported, and hence often recreating rudimentary “wheels” to get the job done.
Here are some basic questions every implementer inevitably faces at some point in their AI adoption journey:
What are the available models, tools and development strategies for my context?
Do I build on open or closed models? What are the trade-offs I should consider (accuracy, flexibility, latency, cost, data safety, long-term adaptability, etc.)?
What approach should I use to adapt and specialize the model for my use case (RAG, fine-tuning, prompting techniques, etc.)?
How do I evaluate which model is the right one for my specific language, dialect, context, use case, etc.? What is the right size and structure for an internal benchmark dataset?
How do I orchestrate different models/instances to get the job done, and maintain context between them?
Are models performing according to my expectations post-deployment, and how consistently? Is model performance degrading over time?
How happy are users with the responses, and how do I identify and fix errors made by LLMs in production?
When and how do I upgrade to the newest models, and how much effort would that entail in redesigning the prompts and debugging problems?
For example, Penda's WhatsApp system needs to classify user intent, route to appropriate agents, maintain conversation context, and escalate complex cases to humans. But as Rob explained, "You can't build all of that context and flow into a single prompt. You need tools that define how the bot should behave, categorize user intent, and maintain context between agents. The models aren’t really the biggest gap; the software that helps you put the model pieces together and evaluate them is the biggest gap."
Implementer after implementer I spoke to echoed similar frustrations with the tools that exist today. For example, they need tools to evaluate "Maternal health triage for ASHA workers in Gujrati" or "TB adherence coaching for Nairobi slum dwellers in Sheng"—combinations of language, domain, and use case that no centralized benchmark can anticipate or represent.
From Endless, we are coordinating an effort with some leading benchmarking organizations and ecosystem partners (e.g. Harvard-EPFL LiGHT Laboratory, Audere, CHAI, Qure.ai, IDInsight, Agency Fund, etc.) to fill some of these tooling gaps, but this inevitably requires a system wide effort and significant resources.
9. Coordination failures and “tragedy of the commons”
The above is one example of a broader set of coordination failures across the ecosystem. Despite AI being a ‘blue ocean’ opportunity, current incentive structures, funding crises, and a prevailing “scarcity” mindset are preventing open sharing of successes, failures, and best practices, which slows down progress for the entire ecosystem.
For example, multiple NGOs are independently collecting and labeling hundreds of hours of audio in the same languages, while this could be done a lot more efficiently at an ecosystem level. Organizations that should collaborate are instead competing.
While efforts like PATH’s community of practice around AI for Health is a good step in the right direction, we need many more shared platforms and coordination mechanisms, and even more importantly conducive incentive and funding architectures, for addressing these market failures.
10. Regulatory & Liability Hurdles
Few countries are getting ahead of the AI disruption with enabling regulation. The US is notorious for its litigious legal environment which discourages any meaningful experimentation in patient-impacting AI prior to lengthy FDA approval processes. Even poorer countries sometimes have unfounded and unhelpful requirements such as not allowing health data to leave the country’s borders, despite local data centers incurring much greater security risks than an AWS cloud server. There are of course, exceptions to this, most notably Rwanda within the African continent. But as a general point, there are path dependencies based on decisions made in the 2010s’ digital health era that are hindering experimentation and adoption of AI in many countries.
Even non-technology related policies, such as rules defining what each cadre of health worker can and cannot do must be revised to take proper advantage of AI to task shift responsibilities to lower levels. One possible solution to this is to create “sandbox” environments where governments can relax existing laws and rules to allow AI innovators and implementers to experiment freely, traded against more intense oversight and evaluation requirements. Once these experiments prove successful, policymakers can decide how to update the laws and regulations to allow those innovations to scale.
11. Misaligned and Perverse Incentives
There are of course numerous perverse incentives and interest groups in each ecosystem that will resist AI disruption. Most notable among these is probably doctor’s associations, which tend to be extremely powerful lobbies in many countries. Ministries of Health, which are often run by doctors, are particularly susceptible to such lobbies. During my work with Jeeon in Bangladesh, we repeatedly encountered doctors' associations resistant to any new idea (e.g. delivering care through pharmacies) that was deemed to undermine their authority and status.
As a mentor recently pointed out to me, successful innovations often succeed by “going around” existing interests rather than trying to co-opt them. Mobile money succeeded by working with telecoms, not banks. Medical abortion technology scaled by working with pharmacies and community health workers, not doctors. Similarly, AI in healthcare may find its biggest wins by empowering non-physician providers—pharmacists, paramedics, community health workers—rather than trying to convince doctors to change their workflows. While health ministries might be reluctant, finance ministries and insurance companies have stronger incentives to invest in cost-effective prevention and early treatments, and as a result might be bigger champions of AI if the ROI case can be made to them effectively.
Looking forward
As should be amply clear by now, the bottleneck is no longer "Can AI do X?" The question is: "Can we organize ourselves fast enough to use what AI can already do?"
In other words, it is not about models anymore —it's about mindsets, workflows, financing, and politics. The tools to close that gap are knowable, and likely vastly cheaper than the billions being spent in better and better model development.
What we need now is the collective will to build shared infrastructure, align incentives, and get serious about execution. But first, we must recognize that our old mental model for health systems is headed for a dead-end, and we must reimagine our role in this brave new world.