Two Ways to Integrate
Real-time API for production systems. Batch upload for audits and training data. Use both.
Multi-Stage Evaluation
Choose evaluation depth per project. Fast screening for high-volume, deep analysis for high-stakes. Per-stage model selection optimizes cost and quality.
Single Pass
Fast screening with one evaluation stage. Best for high-volume content where speed matters most.
Two Pass
Balanced depth. First pass triages, second pass evaluates flagged items in detail. Best quality-to-cost ratio.
Three Pass
Maximum depth. Three evaluation stages with escalating model capability. For regulated and high-stakes decisions.
Three Review Modes
Choose the level of oversight that fits your risk profile. Switch modes per project.
Automated
Full AI evaluation, synchronous response. Decision returned via Gate API in 1-4 seconds. No human involvement.
Assisted
AI evaluates and returns a decision immediately. Flagged items are queued for human review in the background. Best of both worlds.
Human
Expert human review on every item. Three-tier workflow: annotator, QA reviewer, subject matter expert. Gold-standard quality.
Integrate in Minutes
Install the SDK, create a client, and evaluate your first AI output. Python and Node.js with zero external dependencies. Advisory or enforced mode.
View full API documentationfrom bookbag import BookbagClient
client = BookbagClient(
api_key="bk_gate_xxx"
)
result = client.gate.evaluate(
input="What is my refund policy?",
output="Full refund within 90 days."
)
# result.decision: allow | flag | block
# result.scores, result.flags, result.audit_idWhat Comes Back With Every Evaluation
Every evaluation — whether via real-time API or batch upload — returns a structured data package. Not just a label.
- Failure AnalysisHallucination, factual error, policy violation, tone issue, over-promising — with severity and business impact
- Rubric ScoresScored 1-5 on correctness, tone, personalization, policy compliance, and confidence
- Gold-Standard RewritesCorrected responses with explanations. For blocked items: SME rationale and evidence citations
- Training Data ExportSFT pairs, DPO preference data, and ranking signals — structured for model fine-tuning
- Complete Audit TrailWho evaluated, when, which taxonomy version, what decision — immutable and searchable
Customizable Taxonomies
Define what matters for your domain. Configure rubrics, failure categories, and policies per project. Version-stamped for audit compliance.
Project-Level Configuration
Each project has its own rubrics, failure categories, and evaluation criteria. Switch configurations per campaign, client, or domain.
Version-Stamped Policies
Every evaluation is logged with the exact taxonomy version used. Trace back to the policy in effect at the time.
Built-in Templates
Start with pre-built templates across 10 AI QA categories. Customize for your domain, or test your skills with 50 interactive quizzes.
Your AI Gets Better Over Time
Every correction becomes training data. Export in standard ML formats to retrain your models.
Input → approved output pairs for fine-tuning your base model
Preference pairs: chosen vs rejected outputs for RLHF training
Multiple outputs ranked by quality for reward model training
Human Review Workflow
When items are flagged or queued for human review, Bookbag's 3-tier workflow routes them to the right expertise.
Annotator Review
First-pass evaluation against your rubrics. Fast, structured workflows for high-volume review.
- Task-based queue
- Rubric-guided evaluation
- Quick approve / reject / escalate
QA Review
Rewrite, approve, or escalate. Corrections become gold-standard examples and training data.
- Edit and approve workflow
- Create approved templates
- Export training data
SME Approval
Subject matter experts make final calls on high-risk items. Full provenance and evidence trails.
- Blocked-only items
- Requires rationale + evidence
- Audit-ready recordkeeping
Get Started
Two paths to production. Choose API-first for real-time integration, or batch upload for audits and training data.
What you can launch in 2 weeks
Ready to evaluate your AI?
Join the teams shipping safer AI with real-time evaluation, audit trails, and continuous improvement.