VFWA: Voice-First Written Assessment
This post was created in true VFWA style. I practice what I preach.
When we set a written assessment, what are we actually trying to find out?
I think we want to know a number of things. What students understand. How they reason. How they argue. How they evaluate information. In short, we’re trying to get at their thinking.
But we can’t view thinking directly. We have to infer it from language. And that’s where the problem starts — because generative AI is fundamentally a language problem. These are large language models. They’re very good at language. And the kind of language they’re best at? Academic English. The prestige register we’ve spent decades telling students is the only acceptable way to present their ideas.
So we’ve got a machine that can produce academic prose instantly, and we’ve been treating academic prose as evidence of thinking. I hope you can see the issue.
The Problem Was Already There
This problem didn’t start with generative AI. ChatGPT made it worse, but it was already broken.
In my previous role teaching English for Academic Purposes, I had students — many from Nigeria, India, other parts of the world — whose first language was English. Just not the right English for academic work. These were students with insightful ideas whose thinking was being obscured by the fact that their language didn’t match what we socially expected academic work to look like.
I found that students were so focused on producing a grammatically and lexically perfect essay that they weren’t focusing on the intellectual skills I actually wanted them to develop: argumentation, synthesis, using evidence, evaluating evidence. We were marking the product and ignoring the process. And fair enough — the product was what I was grading. But the incentive was wrong.
Then along came the academic register machine. ChatGPT. A tool that could produce that prestige language instantly. The thing we’d been treating as evidence of thinking was now cheap and easy to generate.
Even if the link between academic prose and thinking was shaky before, it’s completely broken now.
So What Do We Do?
The instinct has been to try to detect AI use after the fact. But that doesn’t lead anywhere good.
AI detection software isn’t accurate enough to accuse students. Human “vibe checks” on writing style are unreliable and discriminate against the same groups who were already disadvantaged. Declarations from students aren’t evidence — how do we verify them? And traffic-light or levels-of-use systems all rely on detection or declarations to enforce, so they inherit the same problems.
It essentially becomes binary. If you want to ban AI, the assessment has to be fully secure — you have to be watching what the student does. Or, if it’s done at home and completely AI-unrestricted, then there’s no cheating to find. It’s the messy bit in the middle that causes all the trouble: saying AI can be used for some things but not others, then having no reliable way to tell the difference.
We can’t detect it. We can’t enforce it. These are baseline facts. And we can’t begin to solve the problem if we don’t accept them.
So instead of policing, we must redesign.
Voice-First Written Assessment
VFWA is a two-stage assessment model that shifts the focus from trying to work out whether a student used AI to actually capturing evidence of their thinking.
The core idea is simple: anything you don’t want students using AI for has to happen where you can see it. Everything else, you accept AI might be involved — and you design for that.
Stage 1: Capture the Thinking
Stage 1 happens under observed conditions. Students respond to an unseen question — something they can’t pre-prepare with AI beforehand. And here’s where it gets interesting: they get to choose how they express themselves.
If they think best through writing, they write. If they think best through speech, they speak. If bullet points or an argument map captures their thinking better than polished prose, that might also be fine. The point is to capture reasoning, not register.
This stage is ungraded. It’s a sufficiency check — did the student demonstrate enough engagement with the task to provide usable evidence of their thinking? That’s it. No marks riding on it. This matters because it lowers the pressure, gives students space to take intellectual risks, and reduces the incentive to perform rather than think.
It’s also where the “voice-first” comes in, and I mean that in two ways. First, their written authorial voice — their real way of expressing ideas, not a performance of academic language. That might be informal English, a non-standard dialect, or even a different language entirely if that’s how a student thinks most effectively. Second, literally their spoken voice, if that’s how they function best cognitively.
The principle is straightforward: if we allow more linguistic diversity at the point where we’re capturing evidence of reasoning, we also capture more diversity of thought. You speak differently to your grandparents than you do when you’re writing an essay. The boundaries between these ways of using language are socially enforced, not inherent qualities. And in assessment, those boundaries can get in the way of capturing what students actually know.
I have ADHD. My thoughts go at a million miles an hour. They are rarely linear. Writing slows my thinking down, especially when I’m also trying to make it sound formal because I know people will judge my capability if I don’t. But if I can speak my ideas, I can keep up with the speed of my own brain. Writing isn’t thinking for me — it’s just one way of communicating my thoughts. For others, writing absolutely is thinking. That’s fine. The student gets to choose.
You Don’t Have to Go All the Way
I want to be clear about something because I know the linguistic diversity element might sound daunting: you don’t need to implement the full version of this for it to work. The more linguistic flexibility you can offer, the better the validity — but it’s a spectrum, not all or nothing.
At the minimal end, Stage 1 could just be handwritten or typed, in English, but you tell students they don’t need to write formally. They can use bullet points, incomplete sentences, informal language. You’re already increasing validity because you’re reducing the noise that comes from students trying to perform academic register under time pressure rather than just showing you what they’re thinking.
If you can go further — allowing spoken responses, accepting different varieties of English, even multilingual responses where your infrastructure supports translation — then you capture even more of the signal. But start where you can. A version of VFWA that just says “show me your thinking in whatever way gets it out of your head most effectively, and don’t worry about making it sound academic” is already a significant improvement on what most of us are doing now.
Stage 2: Refine and Develop
We keep a copy of their Stage 1 work, and students take a copy home. They develop it, refine it, add to it. They can use AI tools — or not. It’s optional. But they’re working from their own ideas, captured in Stage 1, and building on them.
The thing being assessed here is evaluative judgement: what they chose to keep, what they chose to change, how they used sources and tools, and whether the final submission reflects sound reasoning. If they use AI and it produces mediocre output that they submit unchanged, the problem isn’t really the AI — it’s that they didn’t have the skill to recognise it wasn’t good enough. That’s still assessable.
And because contemporary writing conditions involve digital tools (we had spellcheck and Grammarly before ChatGPT — this is a continuum, not a cliff edge), Stage 2 reflects the actual world students are preparing for. We should be teaching them how to use these tools well, not pretending the tools don’t exist.
The Link Between Stages
The key is that you can put Stage 1 and Stage 2 side by side. You can see: what were the student’s ideas to begin with? How have they developed? Is there a plausible connection?
If the student has completely changed direction and there’s no visible link between what they did in observed conditions and what they’ve handed in — that triggers a safeguard. Not a punishment. A short viva, focused on whether they can explain and justify the changes they made. Did they change their mind because they found a source that convinced them? That’s legitimate intellectual development. Can they not explain it at all? That’s a different situation.
Crucially, this isn’t a language performance test. It’s about reasoning, evidence, and judgement.
And the viva is conditional. You don’t have to do one for every student. Only when the linkage is weak. That keeps it scalable.
Why This Works
When I first prototyped a version of this in my teaching, three things happened.
First, validity improved. I was actually getting evidence of the thing I wanted to assess — whether students could argue, synthesise, and use evidence — rather than just evidence of whether they could write in academic English.
Second, misconduct dropped. The previous term, half the cohort had been flagged for AI use. With this model, there was no way to use AI unfairly, so I didn’t have to go looking for it.
Third, it reduced my workload. No misconduct investigations. No agonising over whether an essay “felt” AI-generated. And the people who took over that module two years later are still using this assessment — still no issues.
The Bigger Picture
I think what I’m proposing here is not just about AI. It’s about something we should have addressed a long time ago.
The research on linguistic diversity in education goes back to the early 1990s at least. People have been arguing for decades that we need to stop treating prestige academic English as the only legitimate form of knowledge expression. That the way someone speaks or writes doesn’t determine the quality of their thinking. That if you’re born into a family that already speaks something close to academic English, you have a head start — and if you’re not, you face a barrier that has nothing to do with your intelligence.
We’ve known this (or, well, sociolinguists have). We just haven’t done much about it in assessment.
Generative AI has forced the issue. The machine was built on trillions of examples of standard language — and now it can produce that language effortlessly. If we were using prestige prose as a proxy for thinking, that proxy is gone. We need direct evidence instead.
And if we’re going to redesign assessment to get direct evidence of thinking, we might as well do it in a way that’s more accessible and more equitable too. That’s not lowering standards. It’s aligning our evidence with what we’re actually trying to measure.
What You Can Do With This
VFWA is designed to be flexible across disciplines. I’ve worked through examples in law, STEM, English literature, and the core model adapts.
The key elements stay the same: an observed stage with an unseen anchor, choice of expression, ungraded sufficiency, then a refinement stage under contemporary conditions with auditable linkage between the two.
If you’re interested in trying this, a good starting point is to ask yourself: what do I actually want evidence that my students can do? And then: can I get that evidence from a final essay alone, knowing what I know about how it might have been produced?
If the answer to the second question is no — and I think for most of us it is — then something needs to change structurally. Not more rules. Not better detection.
Better design.
Find out more at the website.
The full academic paper is forthcoming. I’ll share more detail on implementation in future posts. In the meantime, if you’re working on this problem, I’d love to hear from you.
*In true VFWA style: Stage 1 was this Compassionate Assessment webinar where I talked through these ideas. The transcript was used to create this post - which is Stage 2.






And the million dollar question is: did you use AI to write this post or was it designed out of the process 🤔 😆