Integrations

Web Crawler

The Web Crawler lets you crawl any public website and import the extracted content as MemorySync memories. No OAuth is required — it uses your existing API key. Submit a URL, the crawler fetches and parses pages, and you import the results as memories with a single API call.

What it does

The Web Crawler is a fully managed crawling service accessible via the API at /api/v1/integrations/web-crawler. It handles:

URL validation & SSRF protection — every URL is validated before crawling. Private IP ranges (10.x, 172.16.x, 192.168.x, 127.x) are blocked to prevent Server-Side Request Forgery.
HTML content extraction — pages are parsed and cleaned. Script, style, nav, header, footer, and form elements are stripped. Clean text is extracted from the body or article container.
Metadata extraction — page title, description, author, publish date, and keywords are extracted from meta tags.
Link discovery — same-origin links are extracted for multi-page crawling when depth > 0.
YouTube support — YouTube URLs are detected and handled with a specialized transcript extraction pipeline.

How to start a crawl

Create a crawl job by sending a POST request. The job runs in the background and you can poll its status:

HTTP

POST /api/v1/integrations/web-crawler/crawl
Content-Type: application/json
X-API-Key: ms_live_...

{
  "url": "https://example.com/docs",
  "crawl_type": "page",
  "settings": {
    "max_depth": 2,
    "max_pages": 50
  }
}

The response returns the job ID and status. Poll GET /api/v1/integrations/web-crawler/jobs/{job_id} to track progress.

Crawl settings

Setting	Default / Range	Description
`crawl_type`	`page`	`page` (full page), `article` (clean article extraction), or `document` (structured document).
`max_depth`	0 (range: 0–5)	Maximum link depth. 0 = single page only. Higher values follow internal links.
`max_pages`	10 (range: 1–100)	Maximum number of pages to crawl in a single job.
`store_html`	`false`	Whether to store the original HTML alongside extracted text.
`include_images`	`false`	Whether to include image URLs in the extracted metadata.

System-level limits: 10 MB per page, 100 MB total per crawl, 60 minutes maximum duration, 1.5 second delay between requests.

Content extraction

The crawler uses intelligent content extraction depending on the crawl type:

Page mode — extracts all body text after stripping script, style, nav, header, footer, aside, noscript, iframe, and form elements. Multiple blank lines are collapsed.
Article mode — uses readability-like heuristics. Tries selectors in order: <article>, <main>, [role="main"], .post-content, .article-content, .entry-content, .content, #content, .story-body. Falls back to body if none match.
403 fallback — when a page returns HTTP 403 (blocked by anti-bot rules), the crawler automatically retries through a readable mirror endpoint to attempt text extraction.
Retries — HTTP 429, 500, 502, 503, and 504 responses are retried up to 3 times with exponential backoff (2s, 5s delays).

Import to memories

Once a crawl job completes, import the extracted content as memories:

HTTP

POST /api/v1/integrations/web-crawler/jobs/{job_id}/import
Content-Type: application/json
X-API-Key: ms_live_...

{
  "tags": ["docs", "reference"],
  "content_ids": null
}

Each successfully crawled page becomes a memory with source: "web_crawler". The metadata includes:

crawl_job_id, domain, url, title — from the crawl job.
content_size_bytes, word_count — content statistics.
author, publish_date, description, keywords — from page meta tags (when available).

Pass content_ids to import specific pages, or omit it to import all successfully crawled pages. Content over 50,000 characters is truncated before import.

Security safeguards

Safeguard	Details
SSRF protection	DNS resolution is checked before fetching. Private IPs (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16) are blocked.
Protocol enforcement	Only `http` and `https` protocols are allowed. File, FTP, and other schemes are rejected.
Rate limiting	1.5 second delay between requests. Maximum 5 concurrent crawl jobs per organization.
Content size limits	10 MB per page, 100 MB total per crawl job.
Robots.txt	The crawler respects `robots.txt` rules. Pages blocked by robots.txt are skipped with a `skipped` status.

Job lifecycle

Every crawl job transitions through these statuses:

Status	Meaning
`pending`	Job created, waiting to start crawling.
`processing`	Actively crawling pages.
`completed`	All pages crawled successfully. Ready for import.
`partial`	Completed with some page failures. Successfully crawled pages can still be imported.
`failed`	Crawl failed entirely (e.g., root URL unreachable).
`cancelled`	Cancelled by user via `POST /jobs/{id}/cancel`.

Additional endpoints: GET /jobs (list all jobs), GET /jobs/{id}/content (view crawled content), GET /jobs/{id}/statistics (pages crawled, bytes, timing), DELETE /jobs/{id} (delete completed job), GET /config (view system limits), GET /active (list running crawls).

← Previous

OneDrive

Intelligence Gating