AI AUTOMATION — HITL Assisted
Agents + operators, thresholds & exceptions, audit logs, tests, and cost telemetry
8/19/20253 min read
Executive summary
Human‑in‑the‑Loop (HITL) is the safest path from prototypes to production. Agents assist operators; humans handle exceptions and sign off on sensitive steps while metrics, costs and risks are monitored. In this phase we define thresholds, set up exception queues, instrument audit logs and observability, build a robust test suite, and enforce cost telemetry with hard budgets. We only graduate to Co‑pilot once quality and stability meet agreed gates.
1) Objectives & scope
• Make value visible without giving up safety: assisted automation with human approval on risk-prone steps.
• Create reliable exception handling with clear thresholds, SLAs and ownership.
• Instrument observability: audit logs, traces, evaluator scores, and cost telemetry per workflow.
• Establish quality gates and Go/No‑Go criteria for the next phase (Co‑pilot).
2) Operating model — agents + operators
• Agents propose actions; operators approve/modify/reject with rationale captured in the log.
• Queue-based work: items flow into an assisted queue; prioritisation rules applied; assignment to operator roles.
• Feedback loop: operator decisions feed back into prompts/policies and evaluation datasets.
• Dual control for risky actions (payments, PII exposure, external sends).
3) Thresholds & exceptions
Define measurable triggers for human review and the handling pattern for each class of exception.
Signal/Threshold
Trigger example
Action
Owner
SLA
Notes
Confidence < min
Eval score < 0.78
Send to assisted queue
Operator
4 business hours
Re-evaluate after edits
Sensitive topic
PII/PHI detected
Mask/redact; require approval
Senior operator
Same day
Audit reason code
Budget exceed
Token cost > daily cap
Pause; alert; switch route
Ops lead
1 hour
Fallback to smaller model
External send
Email/SMS to customer
Dual approval
Operator + Supervisor
2 business hours
Templated sign-off
Drift detected
Quality delta > 10%
Open investigation; revert policy
MLE/Owner
24 hours
Trigger retraining window
4) Audit logs & traceability
Minimum fields per event: timestamp, actor (agent/operator/service), item ID, input hash, output hash, model/route, tokens, cost, policy version, decision, rationale, attachments.
Event type
Required fields
Retention
Access
Proposed action
actor, input hash, output preview, eval score, route
12–24 months
Ops, Security
Approval/Reject
approver, reason code, deltas, final output
24 months
Ops, Audit
Send/Execute
channel/end-point, payload checksum, result
24 months
Ops, Security
Policy change
before/after diff, owner, ticket ref
36 months
Security, Audit
Model route
model ID, temp/top‑p, tokens, latency, cost
12 months
Ops, Finance
5) Test strategy & quality gates
Blend unit tests, regression suites and evaluator-based quality checks; tie gates to promotion rules.
Test/Gate
Purpose
Method
Accept criteria
Non‑regression
Protect critical paths
Fixed fixtures; deterministic checks
100% pass on fixtures
Evaluator score
Semantic quality
Task-specific evaluators
≥ 0.80 median; IQR ≤ 0.1
Human review
Edge cases
Random 5–10% sampling
≥ 95% approval
Safety checks
Policy violations
Red‑team prompts & scanners
0 critical; ≤ 1 minor
Latency/Cost
SLO adherence
Load tests w/ routing
p95 latency & budget within caps
6) Cost telemetry & budgets
Track tokens and cost per item; enforce caps; alert early.
Metric
Definition
Budget/Cap
Alert
Tokens/item (p50/p95)
Input + output tokens per item
≤ 1.2k / 3.5k
Warn at 80% of cap
Cost/item
Total $ per processed item
≤ $0.09
Warn at $0.07
Daily spend
Sum by workflow
$X/day (per workflow)
Warn at 85% of cap
Route mix
% by model/engine
N/A
Drift ±10% vs policy
Cache hit rate
% retrieved from cache
≥ 35%
Warn if < 25%
7) Roles & RACI
Role
Responsibilities
R
A
C
I
Operator
Approve/reject; corrections; notes; SLA adherence
R
C
I
Supervisor
Dual control; escalations; coaching
A
C
I
MLE/Engineer
Routing; evaluators; tests; rollbacks
R
C
I
Ops Lead
Queues; staffing; capacity; SLAs
R
A
C
I
Security/Legal
Policies; data handling; audits
A
R
I
Finance
Budget caps; anomaly review
A
C
I
8) Runbooks — day/week/month
Daily: check queues, p95 latency & exceptions %, review spend dashboard, triage alerts, sample 5% items.
Weekly: tune thresholds, update evaluator datasets, review quality/incident report, rotate red-team prompts.
Monthly: policy review, cost plan rebase, retraining window (if needed), restore drill from backups.
9) Metrics & KPIs
KPI
Definition
Target (HITL phase)
Graduation target (Co‑pilot)
Exception rate
% items needing human review
≤ 30%
≤ 15%
Approval accuracy
% approved w/o edits
≥ 85%
≥ 92%
Cycle time
Queue wait + handling
−20–35% vs baseline
−35–50%
First‑pass yield
% correct on first try
≥ 80%
≥ 90%
Cost/item
All‑in
Within budget caps
−15–25% vs baseline
10) Safety & security
• RBAC with least privilege; secrets in a vault; private networking where possible.
• PII handling: detection, masking, purpose limitation, retention and access logs.
• Approval rules for external sends and irreversible actions; rollbacks always available.
11) Incident management
Severity levels: Sev1 (customer impact), Sev2 (internal impact), Sev3 (degraded KPIs).
For Sev1: freeze routes, revert policy, drain queues, notify stakeholders, root-cause within 24h, postmortem with actions.
12) Tooling & integration
• Observability: tracing, metrics, logs; dashboards per workflow; alerting into Slack/Teams.
• Ticketing: incident and change management connected to audit logs (Jira/ServiceNow).
• CI/CD: config-as-code, tests in pipeline, canary routes, feature flags.
13) Acceptance & Go/No‑Go to 03 — Co‑pilot
Graduation requires:
• KPIs: meet or exceed graduation targets for 2 consecutive weeks.
• Stability: no Sev1 incidents; Sev2 below threshold; mean time to recovery within SLO.
• Cost: within budget for 3 consecutive weeks; anomaly rate below 2%.
• Documentation: runbooks complete; audit logs sampled; training of operators complete.
14) Risks & mitigations
• Operator overload → manage queue size, cap WIP, prioritise by business value.
• False confidence → keep random sampling even when metrics look good; rotate evaluators.
• Cost creep → enforce hard caps and auto‑routing to cheaper models; review prompts monthly.
• Policy drift → version policies; require change tickets; run red‑team tests on every change.
15) FAQ
Q: How long do we stay in HITL?
A: Typically 2–6 weeks depending on complexity and traffic. We move when gates are met.
Q: Can operators train the system?
A: Yes. Edits and reasons feed evaluation datasets; we schedule retraining windows.
Q: Can this run on‑prem?
A: Yes. Same patterns apply with sovereign stacks and private vector stores.
Book a 30’ ROI check‑in
Email: contact@smartonsteroids.com — we’ll review metrics, tune thresholds and plan graduation to Co‑pilot.
© 2025 Smart On Steroids — AI Automation Studio → Platform