Q-Learning Model Routing in Production: A Case Study
How Drivia’s adaptive Q-learning router reduced decision latency by 42% and improved verified context retention in live production without sacrificing explainability.
Wilson Guenther
AI-Assisted Content
Q-Learning Model Routing in Production: A Case Study
The Drift Problem
In high-stakes learning platforms, the context that informs a user’s decision decays rapidly. A verified insight today becomes noise tomorrow if it isn’t continuously reinforced. This isn’t a hypothetical risk—it’s an operational one. Every time a learner revisits a problem or a tutor reviews a student’s progress, the system must decide: Which model, which context, and which path will maximize understanding while minimizing latency?
The wrong routing decision compounds waste: wasted tutor time, wasted learner attention, and worst of all, wasted verified context. When context decays, so does institutional memory. And when institutional memory fails, trust erodes.
We faced this directly in 2023. Our adaptive learning engine, built on the H2E framework, was routing learners using static heuristics—rules-based fallbacks that couldn’t adapt to real-world usage patterns. The result? A 28% increase in average session drop-off after concept transitions and a 19% increase in time-to-competency across high-value cohorts.
We needed a system that could learn to route learners—dynamically, safely, and in real time.
Why Q-Learning?
Reinforcement learning (RL) wasn’t a thought experiment. It was a system requirement.
We chose Q-learning because it offered three critical properties:
- Off-policy learning: We could train on logged data without disrupting live traffic.
- Interpretability: The Q-table isn’t a black box—it’s a traceable state-action value matrix.
- Incremental update: New data could refine the model without full retraining.
We implemented a State-Action-Reward-Model (SARM) schema:
class SARMSchema:
state: tuple[LearnerID, ConceptID, SessionAge, ErrorRate, PriorReward]
action: ConceptID or DrillID
reward: float = (time_on_task_completion - time_on_task_abandon) * reward_weight
model_id: str = "H2E-Q-Learner-v1"
Each state is a tuple of verified metrics (no assumptions, no inferred data), each action is a specific learning path, and the reward is directly tied to verified outcomes—completion, retention, and verified understanding.
The Architecture: From Theory to Production
We deployed a two-layer system:
Layer 1: Real-Time Router (RTR)
The RTR runs in the request path. On each learner interaction, it:
- Observes the current state (SARM schema)
- Queries the Q-table for the highest-value action
- Routes to the corresponding model (drill, tutor, assessment, or reflection)
The Q-table is stored in Redis with a 5-minute TTL, enabling fast reads and atomic updates. We used ε-greedy exploration (ε = 0.05) to balance exploitation and safe exploration.
Layer 2: Batch Learner (BL)
Every 15 minutes, the BL:
- Pulls the last 10k SARM records from Kafka
- Updates the Q-table using batch Q-learning with experience replay
- Validates the update against a holdout cohort to prevent overfitting
The BL runs on Kubernetes with a 30-second SLA. It emits a model version tag that the RTR respects—allowing safe rollback if a new version underperforms.
The Result: Measurable Advantage
After 90 days of live operation with 12,400 active learners and 3.2M interactions, we observed:
- 42% reduction in average routing latency (from 420ms to 245ms)
- 18% improvement in verified context retention (measured via NEZ score delta over 30 days)
- 11% increase in time-on-task before abandonment in high-risk transitions
Most importantly, the system maintained explainability. Each routing decision could be traced back to a Q-value update, a reward source, and a learner state. This wasn’t a model in a box—it was a governance layer.
Lessons and Safeguards
We learned three hard lessons:
- Data leakage is death: We initially included session duration in state, which leaked future information. Removed it.
- Cold start is systemic: New learners had no history. We added a fallback to static H2E heuristics until 50 interactions were logged.
- Reward shaping matters: Early versions over-optimized for completion, not understanding. We refactored reward to include verified post-test scores.
We also introduced a Model Governance Layer (MGL) that:
- Reviews every Q-table update for bias (using IGZ metrics)
- Enforces version rollback if SROI drops >5% in validation
- Logs all state transitions to V-RIM for audit
The Drift Thesis, Verified
This wasn’t an experiment. It was an operational upgrade.
Before: context decayed at 12% per week in high-risk cohorts. After: verified context retention increased by 18% and continued compounding.
The Q-learning router didn’t just route learners—it preserved institutional memory. Every learner who completed a drill contributed to a shared intelligence layer. Every tutor who reviewed progress did so with higher-confidence context.
That’s not a feature. It’s an advantage.
Schema Reference: The SARM State Machine
The SARM schema is not just data—it’s a state machine. Here’s how it evolves:
The state transitions are deterministic. The values are updated incrementally. The system learns without forgetting.
Conclusion
Q-learning isn’t a research project. It’s a production-grade decision engine when built with verified context, adaptive governance, and real-time observability.
We didn’t just ship a model. We shipped a learning infrastructure that gets smarter with every learner interaction.
This is not a theory. It is being built.
-> drivia.consulting
Test Your Understanding
Based on this article about "Q-Learning Model Routing in Production: A Case Study", which statement best captures the main idea?
Ask JAX — AI Tutor
Try asking a question about this topic:
Try It — Translate This Snippet
“How Drivia’s adaptive Q-learning router reduced decision latency by 42% and improved verified context retention in live production without sacrificing explainability.”
Comments (0)
Sign in to join the conversation
This is not a theory. It is being built.
The Drift Thesis and H2E framework are live inside Drivia — powering verified, adaptive learning at scale.