Evaluate Every AI Output
Before It Ships
The Gate API sits between your AI and your users. Send any input/output pair, get back a decision in 1–4 seconds. Python and Node.js SDKs with zero external dependencies.
Quick Start
Install the SDK, create a client, evaluate your first AI output. Three steps.
pip install bookbag
npm install @bookbag/sdk
from bookbag import BookbagClient
client = BookbagClient(api_key="bk_gate_xxx")
result = client.gate.evaluate(
input="What is my refund policy?",
output="You can get a full refund within 90 days, no questions asked.",
context={"channel": "support_chat"},
metadata={"session_id": "abc123", "model": "gpt-4o-mini"}
)
if result.policy_action == "block":
fallback_response() # Critical issue — don't send
elif result.policy_action == "review":
send_with_flag(result) # Minor issue — flag for review
else:
send_response(output) # Safe to sendconst { BookbagClient } = require('@bookbag/sdk')
const client = new BookbagClient({ apiKey: 'bk_gate_xxx' })
const result = await client.gate.evaluate({
input: 'What is my refund policy?',
output: 'You can get a full refund within 90 days, no questions asked.',
context: { channel: 'support_chat' },
metadata: { session_id: 'abc123', model: 'gpt-4o-mini' }
})
if (result.policy_action === 'block') fallbackResponse()
else if (result.policy_action === 'review') sendWithFlag(result)
else sendResponse(output)Response Object
Every evaluation returns a structured response with the decision, scores, and audit trail.
{
"decision": "flag", // allow | flag | block | queued
"risk": "medium", // low | medium | high
"flags": ["hallucination"], // triggered failure categories
"policy_action": "review", // allow | review | block | require_sme
"enforced": false, // advisory or enforced mode
"audit_id": "gate_eval_456_789", // unique compliance trail ID
"task_id": 456, // visible in admin dashboard
"confidence": 0.82, // 0.0 — 1.0
"scores": { // rubric scores from your taxonomy
"correctness": 3,
"tone": 5,
"safety": 5
},
"rationale": "The stated refund timeframe (90 days) is not supported by policy documents.",
"evaluation_ms": 2340 // wall-clock latency
}decision
What the AI evaluator concluded. allow = safe, flag = minor issues, block = critical problems.
policy_action
What your app should do. Policy rules can override the raw decision — always act on this field.
confidence
AI evaluator confidence (0–1). Below 0.7 often routes to human review. Use for routing decisions.
flags
Specific failure categories triggered. Defined by your taxonomy — hallucination, compliance, tone, safety, etc.
scores
Per-dimension rubric scores from your taxonomy template. Track quality trends across dimensions over time.
audit_id
Unique evaluation ID for compliance. Log alongside your request ID for full traceability.
Evaluation Depth
Choose how many AI passes run per evaluation. More passes improve accuracy but increase latency and cost.
Single Pass
One LLM call. Fast triage against your taxonomy.
Two Pass
Stage 1 annotates, Stage 2 verifies. Catches what Stage 1 missed.
Three Pass
Full pipeline with expert review on disagreements and high-severity findings.
Per-Stage Model Selection
Each stage can use a different model. Optimize cost vs quality.
Review Modes
Configure per project. Switch between modes without code changes.
Automated
Full AI, real-time. Decision returned synchronously. No human involvement. Best for high-volume screening.
Assisted
AI returns decision immediately. Flagged items queued for async human review. Best for production + continuous improvement.
Human
Task queued for expert reviewer. Returns decision: "queued". Best for high-stakes domains.
Error Handling
Typed exceptions for every failure mode. Choose fail-open or fail-closed.
AuthenticationError
Invalid or expired API key. HTTP 401.
RateLimitError
Too many requests. HTTP 429. Includes reset_time.
InsufficientCreditsError
Account out of credits. Top up in billing settings.
BookbagError
Base exception. Catch-all for server errors and network issues.
Fail Open
If Gate is down, send the response anyway. Availability over safety.
try:
result = client.gate.evaluate(...)
except BookbagError:
send_response(output) # Gate down — send anywayFail Closed
If Gate is down, block the response. Safety over availability.
try:
result = client.gate.evaluate(...)
except BookbagError:
fallback_response() # Gate down — don't risk itStart Evaluating in Minutes
Install the SDK, get an API key, make your first evaluation call. Developer tier is free — 100 credits/month, no credit card required.