Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English
Abstract
Prompt inversion, or prompt reconstruction, has recently emerged as a method for analyzing machine-generated text by inferring a prompt from an observed output and subsequently regenerating text from that inferred prompt. Despite its utility, the empirical behavior of this process remains insufficiently understood across diverse languages, similarity metrics, decoding strategies, and model families. We present a systematic evaluation of prompt inversion using English and Japanese LLM-generated text based on Alpaca-style instruction-response datasets. Our analysis spans five similarity metrics, multiple decoding configurations, prefix augmentation settings, and two primary generator architectures (T5-style and GPT-style). We find that while English consistently outperforms Japanese in BLEU-4, ROUGE-Lsum, METEOR, and BERTScore F1, Japanese exhibits higher Sentence-BERT cosine similarity. Moreover, the optimal decoding strategy varies by language: beam search is superior for English T5 reconstruction, while hybrid decoding excels for Japanese. We also observe that prefix augmentation substantially benefits English T5 reconstruction but hinders Japanese performance. Finally, we show that reconstruction quality differs significantly between T5 and GPT architectures depending on the metric used, suggesting that no single score can fully characterize performance. These results indicate that prompt inversion is a language and metric-sensitive framework rather than a universally stable method.