Measuring and Evaluating the Performance of Generative AI Models for Scam Detection
Online scams cause substantial financial and personal harm, yet existing machine learning approaches to scam detection often struggle to generalize beyond their training data. In this work, we investigate whether Large Language Models (LLMs), with their strong capabilities in understanding intent, context, and reasoning, can effectively detect scams across diverse scenarios without task-specific fine-tuning. We curate and release a comprehensive benchmark dataset of real-world scams spanning multiple formats and topics. We evaluate nine LLMs of varying sizes and architectures, examining their performance under different prompting strategies and comparing them to a fine-tuned BERT-based classifier. Our results show that while larger LLMs generally outperform smaller ones, effective prompting substantially boosts the performance of smaller models. Moreover, LLMs are better at generalizing to unseen scams compared to fine-tuned models, suggesting that pre-trained knowledge contributes meaningfully to scam detection. We release our dataset and evaluation framework to facilitate future research in robust scam detection using language models.