Web Crawler
The Web Crawler lets you crawl any public website and import the extracted content as MemorySync memories. No OAuth is required — it uses your existing API key. Submit a URL, the crawler fetches and parses pages, and you import the results as memories with a single API call.
What it does
The Web Crawler is a fully managed crawling service accessible via the API at /api/v1/integrations/web-crawler. It handles:
- URL validation & SSRF protection — every URL is validated before crawling. Private IP ranges (10.x, 172.16.x, 192.168.x, 127.x) are blocked to prevent Server-Side Request Forgery.
- HTML content extraction — pages are parsed and cleaned. Script, style, nav, header, footer, and form elements are stripped. Clean text is extracted from the body or article container.
- Metadata extraction — page title, description, author, publish date, and keywords are extracted from meta tags.
- Link discovery — same-origin links are extracted for multi-page crawling when depth > 0.
- YouTube support — YouTube URLs are detected and handled with a specialized transcript extraction pipeline.
How to start a crawl
Create a crawl job by sending a POST request. The job runs in the background and you can poll its status:
POST /api/v1/integrations/web-crawler/crawl
Content-Type: application/json
X-API-Key: ms_live_...
{
"url": "https://example.com/docs",
"crawl_type": "page",
"settings": {
"max_depth": 2,
"max_pages": 50
}
}The response returns the job ID and status. Poll GET /api/v1/integrations/web-crawler/jobs/{job_id} to track progress.
Crawl settings
| Setting | Default / Range | Description |
|---|---|---|
crawl_type | page | page (full page), article (clean article extraction), or document (structured document). |
max_depth | 0 (range: 0–5) | Maximum link depth. 0 = single page only. Higher values follow internal links. |
max_pages | 10 (range: 1–100) | Maximum number of pages to crawl in a single job. |
store_html | false | Whether to store the original HTML alongside extracted text. |
include_images | false | Whether to include image URLs in the extracted metadata. |
System-level limits: 10 MB per page, 100 MB total per crawl, 60 minutes maximum duration, 1.5 second delay between requests.
Content extraction
The crawler uses intelligent content extraction depending on the crawl type:
- Page mode — extracts all body text after stripping script, style, nav, header, footer, aside, noscript, iframe, and form elements. Multiple blank lines are collapsed.
- Article mode — uses readability-like heuristics. Tries selectors in order:
<article>,<main>,[role="main"],.post-content,.article-content,.entry-content,.content,#content,.story-body. Falls back to body if none match. - 403 fallback — when a page returns HTTP 403 (blocked by anti-bot rules), the crawler automatically retries through a readable mirror endpoint to attempt text extraction.
- Retries — HTTP 429, 500, 502, 503, and 504 responses are retried up to 3 times with exponential backoff (2s, 5s delays).
Import to memories
Once a crawl job completes, import the extracted content as memories:
POST /api/v1/integrations/web-crawler/jobs/{job_id}/import
Content-Type: application/json
X-API-Key: ms_live_...
{
"tags": ["docs", "reference"],
"content_ids": null
}Each successfully crawled page becomes a memory with source: "web_crawler". The metadata includes:
crawl_job_id,domain,url,title— from the crawl job.content_size_bytes,word_count— content statistics.author,publish_date,description,keywords— from page meta tags (when available).
Pass content_ids to import specific pages, or omit it to import all successfully crawled pages. Content over 50,000 characters is truncated before import.
Security safeguards
| Safeguard | Details |
|---|---|
| SSRF protection | DNS resolution is checked before fetching. Private IPs (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16) are blocked. |
| Protocol enforcement | Only http and https protocols are allowed. File, FTP, and other schemes are rejected. |
| Rate limiting | 1.5 second delay between requests. Maximum 5 concurrent crawl jobs per organization. |
| Content size limits | 10 MB per page, 100 MB total per crawl job. |
| Robots.txt | The crawler respects robots.txt rules. Pages blocked by robots.txt are skipped with a skipped status. |
Job lifecycle
Every crawl job transitions through these statuses:
| Status | Meaning |
|---|---|
pending | Job created, waiting to start crawling. |
processing | Actively crawling pages. |
completed | All pages crawled successfully. Ready for import. |
partial | Completed with some page failures. Successfully crawled pages can still be imported. |
failed | Crawl failed entirely (e.g., root URL unreachable). |
cancelled | Cancelled by user via POST /jobs/{id}/cancel. |
Additional endpoints: GET /jobs (list all jobs), GET /jobs/{id}/content (view crawled content), GET /jobs/{id}/statistics (pages crawled, bytes, timing), DELETE /jobs/{id} (delete completed job), GET /config (view system limits), GET /active (list running crawls).