Simple techniques to bypass GenAI text detectors: implications for inclusive education

0:00 0:00

Download the Audio (Right-click, Save-As)

For as long as there have been classrooms, there have been students in those classrooms figuring out ways to cheat. And as long as students have been cheating, there have been teachers and administrators trying to catch them in the act. Nothing about this cat-and-mouse game is new.

What is new however, is the power of the AI tools that students now have available to them. Foundation Models exposed through interfaces like ChatGPT can produce a decent term-paper for a student in seconds, and this has drastically changed the playing field. So what’s a teacher or professor to do? How can they be expected to identify which papers were actually written by their students, and which were copied-and-pasted from a GenAI tool?

There are a number of “detectors” on the market that claim to have this problem solved. Professors can, supposedly, simply upload a student’s essay and the detector will tell them the likelihood that it’s AI generated. Sounds simple and incredibly useful, right? The question is: Do these detectors actually work, or are they easily outsmarted? Or worse yet: is the widespread use of these detectors causing actual harm: falsely flagging genuine essays (that were written by human beings) as being AI, and leading professors to accuse honest students of cheating?

That’s the topic of today’s article. In this paper the authors are examining the problem from three angles, in order to answer three related questions.

Angle 1: The Foundation Models:
Students use them to generate text content. How effective are they at tricking the Detectors, and are some Foundation Models more effective at cheating than others?

**Angle 2: The Detectors:**
Faculty and administrators use them to determine if student-submissions are AI-generated. How reliable are these detectors, and how do they compare to each other?

Angle 3: The Adversarial Techniques (prompts):
Students use these techniques when interacting with the Foundation Models in order to deliberately trick the Detectors. Is every technique equally effective against each detector? Or are some better than others?

Here’s how they ran their experiment. First, they chose three Generative A.I. tools:

GPT-4, by OpenAI
Bard (now known as Gemini), by Google
Claude 2, by Anthropic

And for each of these three tools, they provided five prompts. Each prompt was designed to elicit one of five responses from the model:

A short-form essay
A blog post for a professional audience
A cover letter to apply for an internship
A middle-school level comparative analysis task
A magazine article for a broad audience

After running the prompts for all three tools, they had 15 raw/unmodified AI-generated responses. Now they passed each of the 15 responses through six different adversarial prompt-engineering techniques:

Add spelling errors to the content
Increase "burstiness" in the content
Increase complexity in the content
Decrease complexity in the content
Rewrite the content as a Non-Native English Speaker (NNES)
Paraphrase the content

After running each of the 15 raw responses through the 6 adversarial techniques they had a total of 90 adversarial responses. When added to the original 15 raw responses, they had a total of 105.

Then the co-authors wrote 10 more essays by hand (without the use of AI) to use as a control groups. They added these essays to the mix, bringing the total to 115.

Next, they ran each of these 115 essays through 7 different Detectors:

Turnitin AI Detector
GPTZero
ZeroGPT
Copyleaks
Crossplag
GPT-2 Output Detector
GPTKit

So in total, they ran 805 tests (7 x 115). One errata to note: in the paper, the authors mention that one of the 90 adversarial responses was faulty (leaving 89), so I believe they actually ran a total of ~798 total tests, not 805. But at any rate, they ran a whole bunch. Here’s what they found:

For the control group (the human-written essays): Only 67% of the Detectors correctly classified them as being written by a human being, wrongly flagging the remaining 33% as being AI generated.

**For the baseline of 15 raw responses:**
In total, the detectors only flagged 39.5% of the texts as being AI-generated. The highest-performing detector caught ~65% of them, and the worst only caught ~26%. The detectors as a whole caught Bard (Gemini) the easiest at nearly 77% of the time, while struggling to catch GPT-4 or Claude at ~24% and ~18% respectively.

For the adversarial responses:
Five of the six adversarial techniques were successful at significantly decreasing the likelihood that the detectors would flag the essay. Adding-spelling-errors was the most effective technique, decreasing the detectors’ accuracy by 27%. This was followed by "burstiness" at 24%, paraphrasing at 21% decreasing-complexity at 19%, and writing-as-an-NNES at 12%. The only technique that wasn’t particularly effective was increasing-complexity. It only dropped detector-accuracy by 2%. It’s worth noting that the accuracy-drops were not consistent across content from all the Foundational Models. The majority of the accuracy-drop came from content that was generated by Bard (Gemini). In other words: while Bard was the most easily detected in the baseline tests, applying adversarial techniques dropped the detector’s ability to flag Bard content by almost 39%, whereas content from the other Foundational Models saw less than 10% drops.

Overall, the results don’t paint a particularly flattering picture of the Detectors. Let’s say you’re a teacher and you have a class of 30 students. In your class 15 of the students write their essays by hand (let’s call these the “honest” students), and the other 15 use Foundation Models and adversarial techniques to generate their essays for them (let’s call these the “cheaters”). This study shows that if you use these detectors and take their output at face value, you may end up incorrectly accusing 5 of the 15 honest students of cheating, while missing between 6-14 of the 15 cheaters.

More broadly, what can we take away from this? Well, this research has implications far beyond academia. Teachers aren’t the only people who need to know if a given piece of content is A.I. generated or not. Search engines need to know this information, social networks need to know, news agencies need to know, book publishers need to know. And in a very real sense, we all individually will need to know. Even the companies that train the Foundation Models themselves need to know, so that they don’t accidentally train their next generation of models on the previous generations’ output.

So for all of our sakes it sure would be nice if these detectors worked. It would be awesome if a simple three-sentence prompt couldn’t outwit the most sophisticated detectors in the world. But we need to face reality. According to the results in this paper (which you’re welcome to replicate yourself), relying on the detectors to flag content for you may not be a winning strategy, or even a sound one.

Let's take a step back and think about the risks of these tools: When a Journalist, or an Author is falsely accused of using generative AI, it can be career-ending for them. When a student is falsely accused, it can be disillusioning, heartbreaking and academically devastating. So if we’re going to use these detectors at all, we need to be prepared for those consequences. Hopefully detectors, as a technology, will grow more and more sophisticated and accurate over time. Until they do, please exercise extreme caution.

If you’d like to replicate any of these results yourself please download the paper. They provide a step by step breakdown of exactly how they went about their analysis, and their data availability statement links to everything you need to crunch the numbers yourself.

[