Q-Learning Model Routing in Production: A Case Study

The Drift Problem

In high-stakes learning platforms, the context that informs a user’s decision decays rapidly. A verified insight today becomes noise tomorrow if it isn’t continuously reinforced. This isn’t a hypothetical risk—it’s an operational one. Every time a learner revisits a problem or a tutor reviews a student’s progress, the system must decide: Which model, which context, and which path will maximize understanding while minimizing latency?

The wrong routing decision compounds waste: wasted tutor time, wasted learner attention, and worst of all, wasted verified context. When context decays, so does institutional memory. And when institutional memory fails, trust erodes.

We faced this directly in 2023. Our adaptive learning engine, built on the H2E framework, was routing learners using static heuristics—rules-based fallbacks that couldn’t adapt to real-world usage patterns. The result? A 28% increase in average session drop-off after concept transitions and a 19% increase in time-to-competency across high-value cohorts.

We needed a system that could learn to route learners—dynamically, safely, and in real time.

Why Q-Learning?

Reinforcement learning (RL) wasn’t a thought experiment. It was a system requirement.

We chose Q-learning because it offered three critical properties:

Off-policy learning: We could train on logged data without disrupting live traffic.
Interpretability: The Q-table isn’t a black box—it’s a traceable state-action value matrix.
Incremental update: New data could refine the model without full retraining.

We implemented a State-Action-Reward-Model (SARM) schema:

python

class SARMSchema:
    state: tuple[LearnerID, ConceptID, SessionAge, ErrorRate, PriorReward]
    action: ConceptID or DrillID
    reward: float = (time_on_task_completion - time_on_task_abandon) * reward_weight
    model_id: str = "H2E-Q-Learner-v1"

Each state is a tuple of verified metrics (no assumptions, no inferred data), each action is a specific learning path, and the reward is directly tied to verified outcomes—completion, retention, and verified understanding.

The Architecture: From Theory to Production

We deployed a two-layer system:

Layer 1: Real-Time Router (RTR)

The RTR runs in the request path. On each learner interaction, it:

Observes the current state (SARM schema)
Queries the Q-table for the highest-value action
Routes to the corresponding model (drill, tutor, assessment, or reflection)

The Q-table is stored in Redis with a 5-minute TTL, enabling fast reads and atomic updates. We used ε-greedy exploration (ε = 0.05) to balance exploitation and safe exploration.

Layer 2: Batch Learner (BL)

Every 15 minutes, the BL:

Pulls the last 10k SARM records from Kafka
Updates the Q-table using batch Q-learning with experience replay
Validates the update against a holdout cohort to prevent overfitting

The BL runs on Kubernetes with a 30-second SLA. It emits a model version tag that the RTR respects—allowing safe rollback if a new version underperforms.

The Result: Measurable Advantage

After 90 days of live operation with 12,400 active learners and 3.2M interactions, we observed:

42% reduction in average routing latency (from 420ms to 245ms)
18% improvement in verified context retention (measured via NEZ score delta over 30 days)
11% increase in time-on-task before abandonment in high-risk transitions

Most importantly, the system maintained explainability. Each routing decision could be traced back to a Q-value update, a reward source, and a learner state. This wasn’t a model in a box—it was a governance layer.

Lessons and Safeguards

We learned three hard lessons:

Data leakage is death: We initially included session duration in state, which leaked future information. Removed it.
Cold start is systemic: New learners had no history. We added a fallback to static H2E heuristics until 50 interactions were logged.
Reward shaping matters: Early versions over-optimized for completion, not understanding. We refactored reward to include verified post-test scores.

We also introduced a Model Governance Layer (MGL) that:

Reviews every Q-table update for bias (using IGZ metrics)
Enforces version rollback if SROI drops >5% in validation
Logs all state transitions to V-RIM for audit

The Drift Thesis, Verified

This wasn’t an experiment. It was an operational upgrade.

Before: context decayed at 12% per week in high-risk cohorts. After: verified context retention increased by 18% and continued compounding.

The Q-learning router didn’t just route learners—it preserved institutional memory. Every learner who completed a drill contributed to a shared intelligence layer. Every tutor who reviewed progress did so with higher-confidence context.

That’s not a feature. It’s an advantage.

Schema Reference: The SARM State Machine

The SARM schema is not just data—it’s a state machine. Here’s how it evolves:

Learner enters session → State = (L1, C5, Age=2m, Error=0.3, Reward=0.7) Router selects Drill D3 → Action = "Drill.D3" Learner completes drill in 87s → Reward = (87 - 120) * 0.01 = -0.33 Q-table update: Q[(L1,C5,D3)] += α*(-0.33 + γ*max_a Q[(L1,C5+1,...)]) - Q_old

The state transitions are deterministic. The values are updated incrementally. The system learns without forgetting.

Conclusion

Q-learning isn’t a research project. It’s a production-grade decision engine when built with verified context, adaptive governance, and real-time observability.

We didn’t just ship a model. We shipped a learning infrastructure that gets smarter with every learner interaction.

This is not a theory. It is being built.

-> drivia.consulting

Q-Learning Model Routing in Production: A Case Study

Q-Learning Model Routing in Production: A Case Study

The Drift Problem

Why Q-Learning?

The Architecture: From Theory to Production

Layer 1: Real-Time Router (RTR)

Layer 2: Batch Learner (BL)

The Result: Measurable Advantage

Lessons and Safeguards

The Drift Thesis, Verified

Schema Reference: The SARM State Machine

Conclusion

Test Your Understanding

Ask JAX — AI Tutor

Try It — Translate This Snippet

Comments (0)

Related Articles

Human-in-the-Loop AI: Speed Without the Friction

Human-in-the-Loop AI: Speed Without Compromise

Human-in-the-Loop AI: Speed Without Sacrifice