MemorySyncMemorySync
Integrations

Web Crawler

The Web Crawler lets you crawl any public website and import the extracted content as MemorySync memories. No OAuth is required — it uses your existing API key. Submit a URL, the crawler fetches and parses pages, and you import the results as memories with a single API call.

What it does

The Web Crawler is a fully managed crawling service accessible via the API at /api/v1/integrations/web-crawler. It handles:

  • URL validation & SSRF protection — every URL is validated before crawling. Private IP ranges (10.x, 172.16.x, 192.168.x, 127.x) are blocked to prevent Server-Side Request Forgery.
  • HTML content extraction — pages are parsed and cleaned. Script, style, nav, header, footer, and form elements are stripped. Clean text is extracted from the body or article container.
  • Metadata extraction — page title, description, author, publish date, and keywords are extracted from meta tags.
  • Link discovery — same-origin links are extracted for multi-page crawling when depth > 0.
  • YouTube support — YouTube URLs are detected and handled with a specialized transcript extraction pipeline.

How to start a crawl

Create a crawl job by sending a POST request. The job runs in the background and you can poll its status:

HTTP
POST /api/v1/integrations/web-crawler/crawl
Content-Type: application/json
X-API-Key: ms_live_...

{
  "url": "https://example.com/docs",
  "crawl_type": "page",
  "settings": {
    "max_depth": 2,
    "max_pages": 50
  }
}

The response returns the job ID and status. Poll GET /api/v1/integrations/web-crawler/jobs/{job_id} to track progress.

Crawl settings

SettingDefault / RangeDescription
crawl_typepagepage (full page), article (clean article extraction), or document (structured document).
max_depth0 (range: 0–5)Maximum link depth. 0 = single page only. Higher values follow internal links.
max_pages10 (range: 1–100)Maximum number of pages to crawl in a single job.
store_htmlfalseWhether to store the original HTML alongside extracted text.
include_imagesfalseWhether to include image URLs in the extracted metadata.

System-level limits: 10 MB per page, 100 MB total per crawl, 60 minutes maximum duration, 1.5 second delay between requests.

Content extraction

The crawler uses intelligent content extraction depending on the crawl type:

  • Page mode — extracts all body text after stripping script, style, nav, header, footer, aside, noscript, iframe, and form elements. Multiple blank lines are collapsed.
  • Article mode — uses readability-like heuristics. Tries selectors in order: <article>, <main>, [role="main"], .post-content, .article-content, .entry-content, .content, #content, .story-body. Falls back to body if none match.
  • 403 fallback — when a page returns HTTP 403 (blocked by anti-bot rules), the crawler automatically retries through a readable mirror endpoint to attempt text extraction.
  • Retries — HTTP 429, 500, 502, 503, and 504 responses are retried up to 3 times with exponential backoff (2s, 5s delays).

Import to memories

Once a crawl job completes, import the extracted content as memories:

HTTP
POST /api/v1/integrations/web-crawler/jobs/{job_id}/import
Content-Type: application/json
X-API-Key: ms_live_...

{
  "tags": ["docs", "reference"],
  "content_ids": null
}

Each successfully crawled page becomes a memory with source: "web_crawler". The metadata includes:

  • crawl_job_id, domain, url, title — from the crawl job.
  • content_size_bytes, word_count — content statistics.
  • author, publish_date, description, keywords — from page meta tags (when available).

Pass content_ids to import specific pages, or omit it to import all successfully crawled pages. Content over 50,000 characters is truncated before import.

Security safeguards

SafeguardDetails
SSRF protectionDNS resolution is checked before fetching. Private IPs (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16) are blocked.
Protocol enforcementOnly http and https protocols are allowed. File, FTP, and other schemes are rejected.
Rate limiting1.5 second delay between requests. Maximum 5 concurrent crawl jobs per organization.
Content size limits10 MB per page, 100 MB total per crawl job.
Robots.txtThe crawler respects robots.txt rules. Pages blocked by robots.txt are skipped with a skipped status.

Job lifecycle

Every crawl job transitions through these statuses:

StatusMeaning
pendingJob created, waiting to start crawling.
processingActively crawling pages.
completedAll pages crawled successfully. Ready for import.
partialCompleted with some page failures. Successfully crawled pages can still be imported.
failedCrawl failed entirely (e.g., root URL unreachable).
cancelledCancelled by user via POST /jobs/{id}/cancel.

Additional endpoints: GET /jobs (list all jobs), GET /jobs/{id}/content (view crawled content), GET /jobs/{id}/statistics (pages crawled, bytes, timing), DELETE /jobs/{id} (delete completed job), GET /config (view system limits), GET /active (list running crawls).