Skip to main content

In this week’s blog we are going to highlight one aspect of how to use artificial intelligence (AI) responsibly. It is crystal clear that current AI capabilities from models like ChatGPT and Gemini are jaw dropping. But it is also clear that there are unresolved problems they exhibit. Let’s dive in. And explore the implications for healthcare solutions.

Executive order on safe, secure AI

The executive order (EO) promulgated by the Biden Administration in October 2023, is a sweeping attempt at articulating a series of goals on how to develop and deploy safe, secure and trustworthy AI. In section 8(b) of this order:

To help ensure the safe, responsible deployment and use of AI in the healthcare, public-health, and human-services sectors:

 (i)    Within 90 days of the date of this order, the Secretary of HHS shall …  establish an HHS AI Task Force that shall, … develop a strategic plan that includes policies and frameworks — possibly including regulatory action, as appropriate — on responsible deployment and use of AI and AI-enabled technologies in the health and human services sector.

The task force has indeed been established now, within the 90 days as specified in the EO. The same EO also details a series of areas the task force should consider for regulation and rulemaking. These include predictive AI, safety monitoring, equity and privacy. One point emphasizes, build a framework to identify and track AI-related clinical errors.

The National Academy of Medicine (NAM) recently came up with a broad range of principles healthcare applications need to adhere to. They dubbed this as “AI code of conduct” and are seeking stakeholders’ commitment. Of course, the task force and NAM efforts are just getting started and the regulations and guidance are at least a year away.

Stanford Medicine and Stanford Institute for Human Centered Artificial Intelligence have just established a new initiative, dubbed Responsible AI for Safe and Equitable Health (RAISE Health). They also conducted an inaugural symposium on this front a few weeks ago. The symposium raises more questions than answers, though. Their main insight is that multidisciplinary collaboration is required to create trustworthy AI in healthcare.

Let’s investigate the strengths and weaknesses of the generative AI technology that has become the central focus of the myriad of healthcare applications.

Generative AI and clinical documentation

One core application area which has now become a focus for a range of startups and established companies is the area of automating clinical documentation. The amount of time clinicians spend documenting patient interactions, after the adoption of electronic health records (EHRs), skyrocketed. Over the past decade solutions that addressed the documentation issue focused on speech recognition – where the physician dictates the clinical note. But this takes time after the patient visit. Another popular solution from the physician perspective is human scribes. But the use of human labor with specialized skillsets can be expensive.

Entry of generative AI and large language models (LLMs) with emergent properties in 2022, completely changed the dynamics of what was possible. ChatGPT powered with GPT4 can take sample doctor patient conversations and generate clinical notes. The ability to demonstrate capability out of the box is indeed impressive. However, over the past year, ever since ChatGPT was released, an interesting array of issues have been identified with the use of LLMs.

The most obvious might be related to the factual correctness of generated clinical notes. LLMs are trained to create likely completions to any given textual context, or “prompt.” For example, if this textual context is a partial sentence, LLMs would produce a sensible and grammatically correct completion of that sentence. The factuality of this completion is of no concern to the LLM other than that one might expect a “correct” statement to be more likely than an incorrect one given the complex statistical model of word patterns that are inherent to the LLM. Similarly, if the prompt is a question, one would hope that the factually correct answer is the most likely completion of the prompt. However, LLMs have no concept of factual correctness, or even facts (actually LLMs don’t even have a concept of “answering” a question necessarily, they just create an extension of the prompt that is provided. If that prompt happens to be a question, an answer is a likely completion).

On the example of ambient documentation, the prompt would contain an explanation of the desired output (e.g., “Generate a professional clinical note for the following physician patient dialog”) followed by a transcript of the care dialog. Generative AI models show an impressive level of performance generating physician notes this way, but other than clever prompt design, the AI generation process itself just creates a likely output given the prompt, without explicit controls to create factually correct output. This recent study titled: “Factuality of Large Language Models in the Year 2024,” points to surveys that document an interesting array of problems that LLMs display.

What does this mean for creating clinical notes directly from doctor-patient interactions? LLMs when generating clinical notes, can miss important facts, misstate facts or can completely fabricate facts (referred to as the hallucination problem).

There are strategies to reduce this problem on the generative AI side, for example, AWS HealthScribe would create a linkage of every sentence of the generated note back to the part of the care dialog that substantiates it. While these methods are proven to be effective, the problem in principle remains.

Years ago, when we began introducing speech recognition to the clinical documentation process, we faced similar problems. Speech recognition makes mistakes, and we rely on the attentiveness of the user to catch and correct mistakes where they happen. For speech recognition systems many of the errors are obvious, sometimes downright comical. Still, it is easy for busy physicians to overlook errors and that impact on clinical documentation quality is well studied.

Unlike speech recognition errors, the output of LLMs tend to look credible even where completely wrong. Placing the burden of the review process solely on the authoring physician is likely to create quality issues. Fundamentally, deploying AI solutions requires careful evaluation of the outputs. At an enterprise scale, provider organizations will need to consider quality assurance workflows to measure and address such quality issues.

This raises a second, less obvious issue: when introducing a centralized quality assurance (QA) process to manage note quality – what qualifies as a good note? We touched on the area of factual correctness, which arguably is the most straightforward to capture in a quantifiable quality measure. Take a look at our published work on this front. Another seemingly obvious criterion is that of “completeness,” i.e., whether all important information that was discussed between the physician and the patient is included in the note.

But what information is important, outside of some critical information directly impacting care and reimbursement, this is not well defined and is physician and use case dependent. Erring on the overly inclusive side when generating notes carries cost in adding to the amount of text that needs to be reviewed by the authoring physicians and adds to the workload of the physician reading the note later.

Thus, excessive verbosity can compromise both note quality and usability of a note for communicating salient facts of an encounter. Other criteria could include word choice and professionality of language, repetitiveness of a note, writing style and format etc. Finally, the usability of a note for downstream processes including reimbursement, quality measure reporting etc., need to be considered.

Evaluating the quality of documentation generated by physicians has a long history. In 2008, Stetson et. al, created a Physician Documentation Quality Instrument (PDQI) – a list of 22 dimensions used to analyze documentation quality. This list was updated to a smaller set of 9 factors – PDQI-9 - in 2012. There were also a range of studies that explored the problem of note bloat due to the clinician practice of copying and pasting various sections of clinical notes from previous notes. These factors did not anticipate the generative AI revolution and the problems discussed above. The vendors developing solutions in this space have adapted these metrics for the realities of AI generated notes. One of the preliminary evaluations of quality of clinical notes in a practical setting was recently published by Kaiser Permanente. They adapted the PDQI-9 to incorporate hallucinations and bias to make it a 10-factor quality evaluation. They also showed the empirical impact of adopting the technology in the ‘pajama time’ – the time spent on documenting by clinicians outside of normal business hours.

Finally, there is also the issue of data use and retention. Ambient documentation inherently requires the recording of the dialog between the patient and a caregiver. How long though is this data retained and what is it used for? We discussed that what makes for a good note depends on the physician’s specialty, care setting and even type of visit (e.g. care providers might tend to be verbose in capturing information on a new patient than for a routine follow up).

One way of addressing customization is through adaptation of the AI system to the individual user, a facility or provider organization, or even of the base system used across providers. Adaptation though requires the storage and retention of sensitive information. It also carries the risk that the AI system might improperly memorize aspects of an episode and leak information across patient or even organizational boundaries. This phenomenon is currently at the forefront of conversation with image generating AI – such image generators can be tricked to create pictures resembling copyrighted training material. The same problem in principle exists for LLMs adapted to perform ambient documentation tasks and needs to be closely managed in the training process and measured in the resulting adapted system.

At Solventum, we have four decades of experience in providing tools that automate clinical documentation workflows. First using speech recognition, then using virtual scribe workflows. Now utilizing ambient devices, we field hybrid and direct-to-provider workflows supporting the automation of generating clinical documents from doctor-patient conversations. Deploying such solutions, however, requires careful attention to a myriad of factors, particularly in the healthcare context. Patient safety and care is paramount.

“Juggy” Jagannathan, PhD, is an AI evangelist with four decades of experience in AI and computer science research. 

Detlef Koll is vice president of global R&D, health information systems, Solventum

About the authors

Detlef Koll  headshot 1800x1200
Detlef Koll

Vice president global R&D

Juggy Jagannathan headshot 1800x1200
V. Juggy Jagannathan

AI evangelist