BookbagBookbag

Every AI Output.
Evaluated Before It Ships.

Bookbag is the evaluation platform that sits between your models and your users. Real-time quality gates via API. Multi-stage AI pipeline with human oversight. Complete audit trails and training data — from chatbots to medical AI.

allow
flag
block
require_sme
Python
from bookbag import BookbagClient
client = BookbagClient(api_key="...")
result = client.gate.evaluate(input, output)
SOC 2 Type IIZero-Dependency SDKs1–4s Evaluation LatencyPython + Node.js
Developer-First

Ship Safer AI with Three Lines of Code

The Gate API evaluates every AI output against your taxonomy in real time. Python and Node.js SDKs with zero external dependencies. Advisory or enforced mode. Fail-open or fail-closed. Returns a decision in 1–4 seconds.

View full API documentation
Python
from bookbag import BookbagClient

client = BookbagClient(api_key="bk_gate_xxx")

result = client.gate.evaluate(
    input="What is my refund policy?",
    output="Full refund within 90 days."
)

if result.policy_action == "block":
    fallback_response()  # Critical issue
else:
    send_response(output)  # Safe to ship
Node.js
const { BookbagClient } = require('@bookbag/sdk')
const client = new BookbagClient({ apiKey: 'bk_gate_xxx' })

const result = await client.gate.evaluate({
    input: 'What is my refund policy?',
    output: 'Full refund within 90 days.'
})

if (result.policy_action === 'block') fallbackResponse()

One Platform. Every AI Output Covered.

From real-time API gates to batch analysis, from AI-only evaluation to full human review — the infrastructure to evaluate, annotate, improve, and audit every AI output your organization produces.

Real-Time Quality Gates

Gate API + SDK. Every AI response evaluated in 1–4 seconds. Allow, flag, or block before it reaches users.

Multi-Stage Evaluation Pipeline

Fast (1-pass), Standard (2-pass), Deep (3-pass). Per-stage model selection — cheap model for triage, smart model for edge cases.

Three Review Modes

AI-only for speed. Human-only for gold standard. Hybrid for production with continuous improvement.

Customizable Taxonomies

Define what matters — hallucination, compliance, tone, safety. Rubric templates for any domain. Version-stamped for audit.

Training Data Generation

Every correction becomes SFT, DPO, or ranking data. Export to fine-tune your models. Close the feedback loop.

Complete Audit Trail

Every evaluation logged with full provenance. Who reviewed, when, which rubric, what decision. Compliance-ready from day one.

How It Works

From API call to decision in seconds. Your AI generates → Bookbag evaluates → Your app enforces.

See full platform →
01

Your AI generates

Chatbot response, copilot suggestion, agent action, content draft — any AI output headed to users.

02

SDK evaluates

One API call sends input + output to the evaluation pipeline. Python, Node.js, or REST.

03

Multi-stage scoring

Stage 1: fast triage. Stage 2: QA verification. Stage 3: expert review. Each stage uses the model you choose.

04

Policy decides

Your rules translate scores into actions: allow, review, block, or require SME. Advisory or enforced.

05

Your app enforces

Act on the decision. Full audit trail persisted automatically. Every evaluation searchable and exportable.

Three Ways to Evaluate

Choose the review mode that fits your risk profile, volume, and quality requirements. Switch modes per project.

Automated

Full AI evaluation, real-time. Decision returned synchronously via Gate API. No human involvement. Results in 1–4 seconds.

Best for: High-volume screening, chatbots, content generation

Assisted

AI evaluates and returns a decision immediately. Flagged items are queued for human review in the background. Best of both worlds.

Best for: Production AI with continuous human oversight

Human

Expert human review on every item. Three-tier workflow: annotator, QA reviewer, subject matter expert. Gold-standard quality.

Best for: Healthcare, legal, finance, training data creation

Built For Teams Deploying AI

Whether you're shipping a chatbot, operating in a regulated industry, or building AI products — Bookbag provides the evaluation infrastructure you need.

AI-First Companies

Deploying chatbots, copilots, or AI agents? Gate every response. Catch hallucinations before users do. Build trust with systematic evaluation.

Chatbots, copilots, AI agents, content systems

Regulated Industries

Healthcare, finance, legal, government. Every AI decision needs an audit trail. Every evaluation needs documented human oversight.

FinServ, healthcare, legal, insurance, government

AI Vendors & Platforms

Build quality into your product. Ship evaluation as a feature. Unblock enterprise deals with audit trails and governance built in.

AI platforms, SaaS tools, OEM partners

Enterprise ML Teams

Systematic evaluation across models. Generate SFT, DPO, and ranking datasets from every correction. Close the feedback loop between production and training.

Model evaluation, training data, fine-tuning

Credit-Based Pricing That Scales

Developer tier: Free — 100 credits/month with Gate API access.

Paid plans from $6,000/month with advanced evaluation depth, workforce management, and enterprise features.

Frequently Asked Questions

Ready to evaluate your AI?

Join the teams shipping safer AI with real-time evaluation, audit trails, and continuous improvement.