Poster
 in 
Workshop: “Could it have been different?” Counterfactuals in Minds and Machines
                        
                    
                    Neuro-Symbolic Models of Human Moral Judgment: LLMs as Automatic Feature Extractors
joseph kwon · Sydney Levine · Josh Tenenbaum
As AI systems gain prominence in society, concerns about their safety become crucial to address. There have been repeated calls to align powerful AI systems with human morality. However, attempts to do this have used black-box systems that cannot be interpreted or explained. In response, we introduce a methodology leveraging the natural language processing abilities of large language models (LLMs) and the interpretability of symbolic models to form competitive neuro-symbolic models for predicting human moral judgment. Our method involves using LLMs to extract morally-relevant features from a stimulus and then passing those features through a cognitive model that predicts human moral judgment. This approach achieves state-of-the-art performance on the MoralExceptQA benchmark, improving on the previous F1 score by 20 points and accuracy by 18 points, while also enhancing model interpretability by baring all key features in the model's computation. We also run an experiment verifying that the features identified as important by the LLM are actually important to the LLM's computation, by creating counterfactual scenarios in which the feature values are varied, and asking the LLM for zero-shot moral acceptability judgments. We propose future directions for harnessing LLMs to develop more capable and interpretable neuro-symbolic models, emphasizing the critical role of interpretability in facilitating the safe integration of AI systems into society.