ICML Can Large Language Models Reason Algorithmically in an Interactive Environment?

Poster
in
Workshop: Neural Conversational AI Workshop - What’s left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) chatbots?

Can Large Language Models Reason Algorithmically in an Interactive Environment?

Siwei Yang · Yi Xu · Shitong Xu · Zhongkai Zhao · Bingchen Zhao

[ Abstract ]

Abstract:

We are proposing a novel benchmark to evaluate the performance of a large language model to reason following a certain algorithmic procedure such as depth first search (DFS).Our evaluation protocol is designed to be interactive, for example in DFS, the edge connected to one node will only be availble to the tested model after the model has reached this node.Thus, in order to perform such a DFS procedure, the model will need to be able to maintain a memory of which nodes have been visited, and reason about the next node it will go to.We create similar interaction environment with three different algorithms, namely binary search, depth-first search, and breadth-first search.We evaluate the algorithmic reasoning ability of six models using our proposed benchmark and found that there still exists a significant gap between open-sourced Vicuna-13B and the GPT-3.5 model.We hope our benchmark and the experimental findings inspire future works on the direction of algorithmic reasoning in large language models.

Chat is not available.

Poster in Workshop: Neural Conversational AI Workshop - What’s left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) chatbots?

Can Large Language Models Reason Algorithmically in an Interactive Environment?

Siwei Yang · Yi Xu · Shitong Xu · Zhongkai Zhao · Bingchen Zhao

Poster
in
Workshop: Neural Conversational AI Workshop - What’s left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) chatbots?