How are you weighting human review against synthetic evals?
Looking for production examples where qualitative review and automated scoring disagree, especially on multi-turn support agents.
38replies
1.9kviews
A focused board for model evaluation, retrieval systems, agent reliability, inference operations, and field notes from production AI systems.
Looking for production examples where qualitative review and automated scoring disagree, especially on multi-turn support agents.
A postmortem on chunk boundaries, metadata filters, and why the staging replay set missed the issue.
Patterns for exposing plans, intermediate artifacts, and audit trails while keeping the user interface calm.
Comparing queue depth, warm pools, cache pressure, and batch sizing across common deployment setups.
Shared notes, paper links, and discussion prompts for this month's applied AI safety reading group.