# Data sources

> Everything your agent knows comes from data sources — website crawls, file uploads, pasted text, and Q&A pairs. Learn each type, how training (chunk, embed, index) works, Q&A priority, and how to keep knowledge fresh with retraining.

A Bookbag agent only answers from what you teach it. **Data sources** are how you teach it. Each source you add is extracted, split into chunks, embedded into a vector index, and made retrievable — so when a customer asks a question, the agent pulls the most relevant pieces of *your* content and answers from those.

You manage sources from an agent's **Data sources** tab. This page covers the four source types, what happens when a source trains, how Q&A pairs take priority, and how to keep everything current.

> **WHERE DATA SOURCES LIVE:** Sources are scoped to a single agent. Two agents in the same workspace each have their own knowledge — add a source to the agent that should answer from it. Manage them at [app.bookbag.ai](https://app.bookbag.ai) under your agent's **Data sources** tab.

## The four source types

Bookbag supports four kinds of data source. Most stores use a mix: a website crawl for the bulk of content, files for catalogs and policy PDFs, and a handful of Q&A pairs to pin the answers that must be exact.

| Type | What it ingests | Best for |
| --- | --- | --- |
| Website | A bounded multi-page crawl that follows links from a starting URL and extracts the readable text of each page. | Help centers, FAQ pages, policy and shipping pages. |
| File | An uploaded document (PDF, doc, spreadsheet, text). Bookbag extracts and chunks the text. | Product catalogs, manuals, policy documents, exported macros. |
| Text | A snippet you paste directly into a text box. | A policy that isn't written down anywhere public yet. |
| Q&A | An exact question paired with the exact answer you want returned. | High-stakes answers: refund window, warranty terms, shipping cut-offs. |

### Website crawl

Enter a URL and Bookbag crawls outward from it, following links and extracting the readable content of each page it finds — stripping navigation, footers, and boilerplate. A single website source can produce many documents (one per page) and many chunks.

> **POINT THE CRAWLER AT CONTENT, NOT YOUR HOMEPAGE:** Your homepage is mostly marketing and navigation. Start the crawl at your help center or FAQ — it is the highest-signal content for support and produces far better retrieval.

### File upload

Upload a document and Bookbag extracts the text, chunks it, and embeds it. Good for content that lives in PDFs or spreadsheets — product specs, return-policy documents, internal macros you want the agent to draw on.

### Text snippet

Paste text straight in. This is the fastest way to capture a policy or process that lives in someone's head and isn't published anywhere the crawler can reach.

### Q&A pairs

A Q&A pair is an exact question and the exact answer you want the agent to give. Unlike crawled or uploaded text — which the model paraphrases — Q&A pairs are treated as **authoritative** and take priority during retrieval.

> **PIN THE ANSWERS THAT MUST BE EXACT:** Anything touching money, eligibility, or legal commitments belongs in a Q&A pair. A handful of well-chosen pairs (refund window, shipping times, warranty terms) eliminates the most damaging category of wrong answers.

## How training works

When you add or retrain a source, Bookbag runs it through an ingestion pipeline. The same four steps run for every non-Q&A source:

1. **Extract** — Bookbag pulls the readable text out of the source — crawling pages for a website, parsing a file, or reading your pasted text.
2. **Chunk** — The text is split into smaller passages sized for retrieval. Tight, single-topic chunks retrieve better than one giant page.
3. **Embed** — Each chunk is turned into a vector with the agent's embedding model, capturing its meaning so similar questions match it.
4. **Index** — The vectors are stored in the agent's vector index, ready to be searched at query time.

Q&A pairs follow a shorter path: the **question** is embedded and stored alongside the approved answer, so a matching question returns that answer directly.

For the full picture of how indexed sources turn into trustworthy, cited answers at query time, see [Response quality](/docs/getting-started/response-quality).

### Source status

Each source shows a status as it moves through the pipeline:

| Status | Meaning |
| --- | --- |
| Queued | The source is waiting to be processed. |
| Processing | Bookbag is extracting, chunking, and embedding it now. |
| Trained | Done — the agent can answer from this source. |
| Error | Ingestion failed. The source shows the reason (for example, no extractable text or no crawlable pages). |

> **CHECK:** When a source reaches **Trained**, its content is live in the playground and on every connected channel immediately.

> **WHEN A SOURCE ERRORS:** The most common causes are a page with no extractable text (an image-only PDF, or a JavaScript-rendered page the crawler can't read) and a starting URL with no crawlable links. Fix the source or paste the content as a Text source instead, then retrain.

## Q&A priority and how retrieval chooses

At query time Bookbag first checks your Q&A pairs. If the customer's question closely matches a pair, that exact answer is returned and the agent skips paraphrasing entirely. Only when no Q&A pair is a strong match does Bookbag fall back to searching your chunked sources and grounding the model's reply in the top results.

This is why Q&A is the right tool for precision and crawls/files are the right tool for coverage. Use crawls and files to give the agent broad knowledge; use Q&A to lock down the specific answers you cannot afford to get wrong.

## Keeping knowledge fresh

Stale data is the single most common cause of wrong answers. When a policy, price, or shipping timeline changes, update the source.

### Retraining a source

Use **Retrain** on a source to re-run ingestion. Retraining is idempotent: Bookbag clears the source's prior documents, chunks, and Q&A data first, then re-extracts and re-indexes from scratch — so a re-crawled page never leaves stale chunks behind.

1. **Update the underlying content** — Edit the page, re-export the file, or rewrite the Q&A answer.
2. **Retrain the source** — For a website source this re-crawls; for a file, re-upload; for text or Q&A, edit and save.
3. **Confirm it reaches Trained** — The status returns to Trained and the new content is immediately live.

> **SCHEDULED RETRAINING:** On Standard and higher plans, website sources can retrain on a schedule so a changing help center stays current without manual re-crawls. See [Plans & billing](/docs/workspace/billing) for which plans include scheduled retraining.

## Turning real conversations into better data

Your customers tell you where your knowledge has gaps. Bookbag surfaces this two ways:

- **Suggestions** on the Data sources tab flag low-confidence answers (a likely missing-content gap) and thumbs-down answers, with the original question — so you know exactly what source or Q&A pair to add.
- **Improve answer** lets you edit a reply and save it as a high-priority Q&A pair, so the corrected answer is retrieved first next time.

Review these alongside [Activity & chat logs](/docs/agents/activity) on a regular cadence and the agent gets measurably better every week.

## Embedding models and the vector index

Every agent has one embedding model, and all of its chunks are embedded with that model. This matters because retrieval can only compare vectors produced by the same model — Bookbag pins the embedding model per agent so dimensions never mix. If you change an agent's embedding model, retrain its sources so the index is rebuilt consistently. For the trade-offs between embedding models, see [Models & model choice](/docs/agents/models).

## Common questions

**How many sources can I add?**

It depends on your plan — each plan sets a maximum number of sources per agent. See [Plans & billing](/docs/workspace/billing) for the limits. The Free plan is intended for trying things out; paid plans raise the cap substantially.

**Does adding sources cost credits?**

No. Credits are spent on AI replies, not on training. You can add and retrain as many sources as your plan allows without spending credits. See [Credits & usage](/docs/agents/credits).

**A page on my site isn't being picked up by the crawl. Why?**

The crawler follows links and reads server-rendered text. Pages with no inbound links from the starting URL, or content that only renders via JavaScript, may not be captured. Add the page as its own website source, or paste its content as a Text source.

**The agent gave a slightly wrong answer. Should I fix the prompt?**

Usually not. Most wrong answers are a missing or unclear source, not a prompt problem. Add a Q&A pair with the exact correct answer — it pins the wording and short-circuits paraphrase.

**Can two agents share knowledge?**

Not directly — sources are per-agent. If two agents need the same knowledge, add the source to each. This keeps each agent's retrieval clean and scoped to what it should answer.

## What's next

- [Test in the playground](/docs/agents/playground) — Chat with your agent and inspect the exact sources behind each answer.
- [Response quality](/docs/getting-started/response-quality) — How retrieval, citations, and Q&A priority produce trustworthy answers.
- [Models & model choice](/docs/agents/models) — Pick the model and embedding model that fit your agent.
- [Best practices](/docs/getting-started/best-practices) — How to structure knowledge for accurate, on-brand answers.
