AI Conversational Placement Test

I built an AI placement test that automates the complex oral assessment performed by a senior teacher. It takes a 2-week scheduling bottleneck and compresses it into a 30-minute, on-demand assessment.

Role: Product Engineer (Solo Founder)
Status: Pilot at GenkiJACS
Tech: Next.js (OpenNext), Cloudflare (Workers, Durable Objects, D1, WebSockets), Vercel AI SDK
Evals: 600-Sample Benchmark
Website: languagetest.net

1. Context: The 2-Week Bottleneck

Language schools cannot enrol a student until they verify their level. Currently, this requires a live 20-minute interview with a senior teacher.

This manual step creates a massive bottleneck. Between timezones and teacher availability, students often wait 1–2 weeks just to take a placement test. Teachers, meanwhile, burn hours on repetitive beginner assessments instead of teaching.

I needed an asynchronous solution that was rigorous enough to replace the human expert.

2. Solution: A Virtual Examiner

I recognised early on that I lacked the domain expertise to design a language test. Building the logic on my own assumptions would have resulted in a flawed product.

To fix this, I interviewed a senior teacher and analysed hours of audio recordings from real placement tests. I discovered that experts don't just "chat." They follow a structured method to assess proficiency.

How It Works

A school configures a test by linking questions to target grammar patterns (like "Question 1 checks for Past Tense"). The AI's goal is to elicit that specific pattern from the student.

I formalised this behaviour into a state machine:

Ask: Pose the configured question.
Hint: If the target grammar is missing, ask a specific follow-up to elicit it (like "What did you do yesterday?").
Explicit: If they still miss it, explicitly instruct them to use the pattern.
Give Up: If they fail, move to the next question.

This ensures every student gets a standardised chance to prove their ability, unlike a free-form LLM chat which might drift off-topic.

3. Challenge I: Measuring "Correctness"

The hardest technical challenge was assessing the student's language level. Ensuring an AI reliably detects specific grammar patterns like "Conditional Ba-Form" in a noisy, 3-minute spoken transcript by a language learner is difficult.

I solved this by treating the assessment as a precise signal extraction problem, ensuring the system hears every mistake and validates every grammar individually.

Live Interview: During the test, I use Gemini 3 Flash to drive the dialogue. I leverage its native multimodal capabilities to process the student's audio directly. This captures the student's speech verbatim, preserving misconjugations and particle errors that standard STT models typically auto-correct.
Asynchronous Grading: Once the interview concludes, the pipeline performs an exhaustive search on every student response:
- Parallel Extraction: The system scans each answer against ~270 grammar patterns.
- Verification: Any pattern flagged as "present" triggers a secondary LLM to strictly confirm it meets the usage rules.

Search PatternFound CandidatesVerify Candidates

Past Tense

I went to the store. I buy milk. It was good.

"went"

"buy"

"was"

The Eval Suite

To validate this architecture, I created a dataset of ~600 manual annotations. I benchmark the system on Strict Recall. A sample only passes if the AI identifies 100% of the target grammar instances.

Eval Complete Summary: ────────────────────────────────────────────────── Total Samples: 573 Successful Samples: 405 Strict Recall: 70.68% Recall: 72.56% Precision: 84.34% Average: 78.30% F1 Score: 77.88% Total Cost: $0.2417 Average Cost: $0.000422 ──────────────────────────────────────────────────

This infrastructure transforms prompt engineering from guesswork into a deterministic process. I can tweak the prompt, run the suite, and immediately see if performance improved or regressed.

Why Extraction, Not Scoring?

I avoided having the AI directly assign a level (like "N4") because AI scoring is subjective and prone to drift. Instead, I treat the AI as a pattern extractor, not a judge. Its only job is to flag evidence of grammar usage. The final score is calculated using deterministic math based on the Minna no Nihongo curriculum. This guarantees transparency by visualizing the assessment logic. Instead of a black-box label, the system plots performance across Chapters 0–50, allowing teachers to instantly spot the exact point where a student's ability drops off.

X-Axis: Curriculum Chapter (0–50). Y-Axis: Verified Grammar Patterns.

4. Challenge II: Architecture at Scale

I designed this system to work for any language school, not just my pilot partner. The challenge was building a single platform that adapts to different business needs without code changes.

I solved this by building a multi-tenant architecture on Next.js:

Dynamic Subdomains: Every school gets their own URL automatically (like genkijacs.languagetest.net). I used Middleware to route requests based on the hostname. This allows me to onboard new schools and display their unique branding instantly, without deploying new infrastructure.
Immutable Deployment: To allow teachers to update the curriculum freely, I architected a version control system that isolates drafts from the live environment. Edits are made in a sandbox, and every deployment creates an immutable snapshot. This protects active tests and past results, while providing teachers with advanced workflows like visual diffs and atomic rollbacks.
Configurable Student Auth: Schools and referral agencies have varying privacy requirements. I architected a flexible authentication system where admins can toggle fields like Email, Name, or Branch between "Required", "Optional", or "Hidden". This allows the system to seamlessly switch between strict data collection and fully anonymous usage depending on the school.

This enables a self-service workflow, allowing a new school to sign up and configure a privacy-compliant intake flow instantly with zero infrastructure changes.

5. Learnings & Roadmap

The defining product decision was decoupling extraction from scoring. Teachers need consistency, not creativity. By restricting the AI to evidence extraction, I eliminated score drift and gave teachers a metric they could actually observe and trust.

If I were to extend this platform, I would focus on:

Data-Driven Pruning: The current system checks ~270 patterns per student response, which costs ~$0.50. I plan to analyse the pilot data to remove low-signal patterns and reduce costs.
Reducing Voice Latency: The sequential nature of "Transcribe → Evaluate → Generate → Stream" introduces noticeable delays. I want to explore streaming architectures and hybrid model orchestration to make the conversation feel natural and instant, without reverting to low-fidelity STT.

Overall, this project was a lesson in system design, balancing the messy reality of spoken audio with the strict requirements of academic grading.