Add a crawl
Paste the URL
Top-level domain works best:
https://acme.com. You can also start from a sub-section (/docs) to scope the crawl.What gets crawled
- Pages on the same domain.
- Content accessible to anonymous visitors (logged-in areas are not fetched).
- HTML — not PDFs linked from pages. Upload those separately as file uploads.
What gets ignored
- Navigation, footers, cookie banners.
- JavaScript-rendered content that’s not prerendered.
- Pages blocked by
robots.txtornoindex. - Off-domain links.
Page limits
| Plan | Pages per source |
|---|---|
| Starter | 50 |
| Growth | 500 |
| Business | 5 000 |
| Scale | Unlimited |
Watching progress
On the Knowledge list, the source status cycles queued → syncing → synced. Click the row to see individual pages, their status, and word count. Click View pages to inspect what was extracted.Keeping content fresh
A crawl is a snapshot at that moment. Two ways to keep it current:- Recrawl manually — open the source → Recrawl. Re-fetches and re-indexes.
- Schedule auto-recrawl (Business plan and up) — daily or weekly.
Tips
- Start with your top-level domain. Scope down only if the crawl pulls in irrelevant content (blog posts, legal boilerplate).
- After the crawl, chat with the agent (Test) and look for bad answers. Those point at missing or stale pages — patch with a Q&A pair rather than rewriting the site.
- If a specific page shouldn’t be indexed, add it to your
robots.txt.
Troubleshooting
| Issue | Try |
|---|---|
| Crawl status stuck on syncing for >1 hour | Open the source → Retry. If still stuck, contact support. |
| Pages have garbled text | The page is likely JS-rendered. Use a file upload of the content instead. |
| Crawl captured 0 pages | Check the URL scheme (https://), check robots.txt. |
| Answer uses outdated price | The page was crawled before the change. Recrawl the source. |