Timezone: »

Neuro-Symbolic Models of Human Moral Judgment: LLMs as Automatic Feature Extractors
joseph kwon · Sydney Levine · Josh Tenenbaum

As AI systems gain prominence in society, concerns about their safety become crucial to address. There have been repeated calls to align powerful AI systems with human morality. However, attempts to do this have used black-box systems that cannot be interpreted or explained. In response, we introduce a methodology leveraging the natural language processing abilities of large language models (LLMs) and the interpretability of symbolic models to form competitive neuro-symbolic models for predicting human moral judgment. Our method involves using LLMs to extract morally-relevant features from a stimulus and then passing those features through a cognitive model that predicts human moral judgment. This approach achieves state-of-the-art performance on the MoralExceptQA benchmark, improving on the previous F1 score by 20 points and accuracy by 18 points, while also enhancing model interpretability by baring all key features in the model's computation. We also run an experiment verifying that the features identified as important by the LLM are actually important to the LLM's computation, by creating counterfactual scenarios in which the feature values are varied, and asking the LLM for zero-shot moral acceptability judgments. We propose future directions for harnessing LLMs to develop more capable and interpretable neuro-symbolic models, emphasizing the critical role of interpretability in facilitating the safe integration of AI systems into society.

Author Information

joseph kwon (MIT)
Sydney Levine (Allen Institute for AI)
Josh Tenenbaum (MIT)

Joshua Brett Tenenbaum is Professor of Cognitive Science and Computation at the Massachusetts Institute of Technology. He is known for contributions to mathematical psychology and Bayesian cognitive science. He previously taught at Stanford University, where he was the Wasow Visiting Fellow from October 2010 to January 2011. Tenenbaum received his undergraduate degree in physics from Yale University in 1993, and his Ph.D. from MIT in 1999. His work primarily focuses on analyzing probabilistic inference as the engine of human cognition and as a means to develop machine learning.

More from the Same Authors