Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
The AI Disguise: Applying Advanced Tokenization and POS Techniques for Text Authenticity
Andrew Zhang
Abstract:
We propose an enhancement to the Framework originally introduced by Liang (2024). By incorporating different phrase lengths and parts of speech, our improved model more accurately differentiates between AI-generated and human-generated text at the corpus level. Testing on Liang's original validation datasets revealed three models that outperform the baseline error rate of 1.97\%: a 1-word model focusing on adjectives (1.63\%), a 2-word model focusing on adverbs (1.42\%), and a 1-word model focusing on verbs (0.47\%). These enhancements highlight LLMs' biases towards specific tokens and provide a more reliable method for distinguishing AI-generated content from human-generated content.
Chat is not available.