Signs of cognitive decline often appear not in a formal diagnosis, but in the earliest traces hidden in health care providers’ notes.
A new study published Jan. 7 in the journal npj Digital Medicine suggests artificial intelligence (AI) can help identify these early signals – such as memory and thinking problems or changes in behavior – by scanning the doctor’s notes for patterns of concern. These may include repeated mentions of cognitive changes or confusion by the patient, or concerns mentioned by family members attending the meeting with their loved one.
“The goal is not to replace clinical judgment, but to act as a screening aid,” co-author of the study Dr. Lidia Mouraassociate professor of neurology at Massachusetts General Hospital, told Live Science. By highlighting such patients, she said, the system could help doctors decide which people to follow, especially in settings where there is a shortage of specialists.
Whether this kind of screening actually helps patients depends on how it’s used, he said Julia Adler-Milsteinhealth computer scientist at the University of California, San Francisco, who was not involved in the study. “If the symptoms are accurate, go to the right person on the care team, and are actionable, meaning they lead to a clear next step, then yes, they can be easily integrated into the clinical workflow,” she told Live Science in an email.
A team of AI agents, not just one
To create their new artificial intelligence system, the researchers used what they call an “agent” approach. The term refers to a coordinated set of AI programs—five in this case—each with a specific role and checking each other’s work. Together, these collaborative agents iteratively improved how the system interpreted clinical notes without human intervention.
The researchers built the system on Meta’s Llama 3.1 and gave it three years of medical notes to study, including clinic visits, progress notes and discharge summaries. These came from the hospital register and had already been reviewed by clinicians who noted whether cognitive problems were present in the patient’s chart.
The team first showed the AI a balanced set of patient notes, half with documented cognitive concerns and half without, and let it learn from its mistakes as it tried to compare how doctors labeled those notes. At the end of this process, the system agreed with the doctors in 91% of cases.
The completed system was then tested on a separate subset of data that he had not seen before, but which was pulled from the same three-year data set. The second data set was intended to reflect real-world care, so that only about one-third of records were identified by clinicians as showing cognitive concerns.
In this test, the system’s sensitivity dropped to around 62%, meaning it missed nearly four out of ten cases that doctors had labeled positive for signs of cognitive decline.
At first glance, the drop in accuracy looked like a failure — until the researchers rechecked medical records that the AI and human reviewers had classified differently.
Clinical experts reviewed these cases by re-reading the medical records without knowing whether the classification came from the doctors or the AI. In 44% of cases, these reviewers ended up siding with the system rating rather than the original physician chart rating.
“That was one of the most surprising findings,” said the study’s co-author Hossein Estiriassociate professor of neurology at Massachusetts General Hospital.
In many of those cases, he said, the AI applied clinical definitions more conservatively than doctors, refusing to raise concerns when the notes didn’t directly describe memory problems, confusion or other changes in how the patient thought — even if a diagnosis of cognitive decline was listed elsewhere in the record. The AI was trained to prioritize mentions of potential cognitive problems that doctors may not always consider important.
The results highlight the limits of manual chart review by doctors, Moura said. “When the signals are obvious, everyone sees them,” she said. “When they’re subtle, that’s where humans and machines can diverge.”
Karin Verspooran artificial intelligence and health technology researcher at RMIT University, who was not involved in the study, said the system was evaluated against a carefully edited set of medical notes reviewed by a doctor. But because the data came from a single hospital network, she cautioned, its accuracy may not translate to settings where documentation practices differ.
The system’s vision, she says, is limited by the quality of the notes it reads, a limitation that can only be solved by optimizing the system across different clinical environments.
Estiri explained that, for now, the system is intended to run quietly in the background of routine doctor visits and surface potential concerns along with an explanation of how they got there. This means that it is not yet used in clinical practice.
“The idea is not that doctors are sitting there using AI tools,” he said, “but that the system provides insight — what we’re seeing and why — as part of the clinical record itself.”
Tian, J., Fard, P., Cagan, C. et al. An autonomous agent workflow for the clinical detection of cognitive problems using large language models. npj digits. Copper. 9, 51 (2026). https://doi.org/10.1038/s41746-025-02324-4

Leave a Reply