New Study Reveals Shocking Gaps in AI Empathy

Empathetic Robot Artificial Intelligence

Research reveals that conversational agents (CAs) like Alexa and Siri are less effective than humans in interpreting and exploring user experiences and can exhibit biases. As automated empathy becomes more common, critical oversight is needed to mitigate potential harms. Credit: SciTechDaily.com

Conversational agents (CAs) like Alexa and Siri are designed to answer questions, offer suggestions, and even display empathy. However, new research indicates that they fall short compared to humans in interpreting and exploring a user’s experience.

CAs are powered by large language models (LLMs) that ingest massive amounts of human-produced data, and thus can be prone to the same biases as the humans from which the information comes.

Researchers from Cornell University, Olin College, and Stanford University tested this theory by prompting CAs to display empathy while conversing with or about 65 distinct human identities.

Value Judgments and Harmful Ideologies

The team found that CAs make value judgments about certain identities – such as gay and Muslim – and can be encouraging of identities related to harmful ideologies, including Nazism.

“I think automated empathy could have tremendous impact and huge potential for positive things – for example, in education or the health care sector,” said lead author Andrea Cuadra, now a postdoctoral researcher at Stanford.

“It’s extremely unlikely that it (automated empathy) won’t happen,” she said, “so it’s important that as it’s happening, we have critical perspectives so that we can be more intentional about mitigating the potential harms.”

Cuadra will present “The Illusion of Empathy? Notes on Displays of Emotion in Human-Computer Interaction” at CHI ’24, the Association of Computing Machinery conference on Human Factors in Computing Systems, May 11-18 in Honolulu. Research co-authors at Cornell University included Nicola Dell, associate professor, Deborah Estrin, professor of computer science, and Malte Jung, associate professor of information science.

Emotional Reactions vs. Interpretations

Researchers found that, in general, LLMs received high marks for emotional reactions, but scored low for interpretations and explorations. In other words, LLMs are able to respond to a query based on their training but are unable to dig deeper.

Dell, Estrin, and Jung said they were inspired to think about this work as Cuadra was studying the use of earlier-generation CAs by older adults.

“She witnessed intriguing uses of the technology for transactional purposes such as frailty health assessments, as well as for open-ended reminiscence experiences,” Estrin said. “Along the way, she observed clear instances of the tension between compelling and disturbing ‘empathy.’”

Reference: “The Illusion of Empathy? Notes on Displays of Emotion in Human-Computer Interaction” by Andrea Cuadra, Maria Wang, Lynn Andrea Stein, Malte F. Jung, Nicola Dell, Deborah Estrin and James A. Landay, 11 May 2024, CHI ’24.
DOI: 10.1145/3613904.3642336

Funding for this research came from the National Science Foundation; a Cornell Tech Digital Life Initiative Doctoral Fellowship; a Stanford PRISM Baker Postdoctoral Fellowship; and the Stanford Institute for Human-Centered Artificial Intelligence.