The AI Disguise: Applying Advanced Tokenization and POS Techniques for Text Authenticity
Andrew Zhang
Abstract
We propose an enhancement to the Framework originally introduced by Liang (2024). By incorporating different phrase lengths and parts of speech, our improved model more accurately differentiates between AI-generated and human-generated text at the corpus level. Testing on Liang's original validation datasets revealed three models that outperform the baseline error rate of 1.97\%: a 1-word model focusing on adjectives (1.63\%), a 2-word model focusing on adverbs (1.42\%), and a 1-word model focusing on verbs (0.47\%). These enhancements highlight LLMs' biases towards specific tokens and provide a more reliable method for distinguishing AI-generated content from human-generated content.
Chat is not available.
Successful Page Load