BookbagBookbag
Gate API + SDK

Evaluate Every AI Output
Before It Ships

The Gate API sits between your AI and your users. Send any input/output pair, get back a decision in 1–4 seconds. Python and Node.js SDKs with zero external dependencies.

Quick Start

Install the SDK, create a client, evaluate your first AI output. Three steps.

Install
Python
pip install bookbag
Node.js
npm install @bookbag/sdk
Python
from bookbag import BookbagClient

client = BookbagClient(api_key="bk_gate_xxx")

result = client.gate.evaluate(
    input="What is my refund policy?",
    output="You can get a full refund within 90 days, no questions asked.",
    context={"channel": "support_chat"},
    metadata={"session_id": "abc123", "model": "gpt-4o-mini"}
)

if result.policy_action == "block":
    fallback_response()          # Critical issue — don't send
elif result.policy_action == "review":
    send_with_flag(result)       # Minor issue — flag for review
else:
    send_response(output)        # Safe to send
Node.js
const { BookbagClient } = require('@bookbag/sdk')

const client = new BookbagClient({ apiKey: 'bk_gate_xxx' })

const result = await client.gate.evaluate({
    input: 'What is my refund policy?',
    output: 'You can get a full refund within 90 days, no questions asked.',
    context: { channel: 'support_chat' },
    metadata: { session_id: 'abc123', model: 'gpt-4o-mini' }
})

if (result.policy_action === 'block') fallbackResponse()
else if (result.policy_action === 'review') sendWithFlag(result)
else sendResponse(output)

Response Object

Every evaluation returns a structured response with the decision, scores, and audit trail.

JSON Response
{
  "decision": "flag",              // allow | flag | block | queued
  "risk": "medium",                // low | medium | high
  "flags": ["hallucination"],      // triggered failure categories
  "policy_action": "review",       // allow | review | block | require_sme
  "enforced": false,               // advisory or enforced mode
  "audit_id": "gate_eval_456_789", // unique compliance trail ID
  "task_id": 456,                  // visible in admin dashboard
  "confidence": 0.82,              // 0.0 — 1.0
  "scores": {                      // rubric scores from your taxonomy
    "correctness": 3,
    "tone": 5,
    "safety": 5
  },
  "rationale": "The stated refund timeframe (90 days) is not supported by policy documents.",
  "evaluation_ms": 2340            // wall-clock latency
}

decision

What the AI evaluator concluded. allow = safe, flag = minor issues, block = critical problems.

policy_action

What your app should do. Policy rules can override the raw decision — always act on this field.

confidence

AI evaluator confidence (0–1). Below 0.7 often routes to human review. Use for routing decisions.

flags

Specific failure categories triggered. Defined by your taxonomy — hallucination, compliance, tone, safety, etc.

scores

Per-dimension rubric scores from your taxonomy template. Track quality trends across dimensions over time.

audit_id

Unique evaluation ID for compliance. Log alongside your request ID for full traceability.

Evaluation Depth

Choose how many AI passes run per evaluation. More passes improve accuracy but increase latency and cost.

Single Pass

One LLM call. Fast triage against your taxonomy.

Latency500–2000ms
Cost1x Stage 1 model
Best forHigh volume

Two Pass

Stage 1 annotates, Stage 2 verifies. Catches what Stage 1 missed.

Latency1500–4000ms
Cost1–2x (conditional)
Best forProduction AI

Three Pass

Full pipeline with expert review on disagreements and high-severity findings.

Latency3000–10000ms
Cost1–3x (conditional)
Best forCompliance-critical

Per-Stage Model Selection

Each stage can use a different model. Optimize cost vs quality.

Stage 1 — Annotation
gpt-4o-mini (0.10 credits/call)
Fast, cheap — handles the bulk
Stage 2 — QA Verification
gpt-4o (0.25 credits/call)
Smarter — catches edge cases
Stage 3 — Expert Review
o3 (1.00 credits/call)
Best available — hardest cases

Review Modes

Configure per project. Switch between modes without code changes.

Automated

Full AI, real-time. Decision returned synchronously. No human involvement. Best for high-volume screening.

Assisted

AI returns decision immediately. Flagged items queued for async human review. Best for production + continuous improvement.

Human

Task queued for expert reviewer. Returns decision: "queued". Best for high-stakes domains.

Error Handling

Typed exceptions for every failure mode. Choose fail-open or fail-closed.

AuthenticationError

Invalid or expired API key. HTTP 401.

RateLimitError

Too many requests. HTTP 429. Includes reset_time.

InsufficientCreditsError

Account out of credits. Top up in billing settings.

BookbagError

Base exception. Catch-all for server errors and network issues.

Fail Open

If Gate is down, send the response anyway. Availability over safety.

try:
    result = client.gate.evaluate(...)
except BookbagError:
    send_response(output)  # Gate down — send anyway

Fail Closed

If Gate is down, block the response. Safety over availability.

try:
    result = client.gate.evaluate(...)
except BookbagError:
    fallback_response()  # Gate down — don't risk it

Start Evaluating in Minutes

Install the SDK, get an API key, make your first evaluation call. Developer tier is free — 100 credits/month, no credit card required.