Overview

Scrapester converts web pages into clean, structured formats ideal for data processing and LLM applications.
  • Smart Handling: Manages proxies, rate limits, and JavaScript-rendered content
  • Multiple Formats: Outputs clean markdown, structured JSON, raw HTML, or screenshots
  • Intelligent Extraction: Extract specific data using schemas or natural language prompts

Basic Scraping

Installation

npm install scrapester

Usage

import { ScrapesterApp } from 'scrapester';

const app = new ScrapesterApp('sk-YOUR_API_KEY');

// Scrape a website
const result = await app.scrapeUrl('example.com', {
    formats: ['markdown', 'html']
});
console.log(result);

Response Format

{
  "success": true,
  "data": {
    "markdown": "# Welcome to Example\nThis is the main content...",
    "html": "<!DOCTYPE html><html><body>...</body></html>",
    "metadata": {
      "title": "Example Website",
      "description": "An example website description",
      "language": "en",
      "sourceURL": "https://example.com",
      "statusCode": 200
    }
  }
}

Extraction Without Schema

Extract data using natural language prompts:
curl -X POST https://api.scapester.com/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk-YOUR_API_KEY' \
    -d '{
      "url": "https://example.com",
      "formats": ["extract"],
      "extract": {
        "prompt": "Extract the product pricing and features from the page."
      }
    }'

Advanced Features

Screenshots

Capture visual snapshots of web pages:
result = app.scrape_url('example.com', 
    params={
        'formats': ['screenshot'],
        'screenshot': {
            'fullPage': True,
            'type': 'jpeg',
            'quality': 80
        }
    }
)

Custom Headers

Add authentication or custom headers to your requests:
result = app.scrape_url('example.com', 
    params={
        'formats': ['markdown'],
        'headers': {
            'Authorization': 'Bearer token123',
            'Cookie': 'session=abc123'
        }
    }
)

Error Handling

Scrapester provides detailed error information when something goes wrong:
{
  "success": false,
  "error": {
    "code": "INVALID_URL",
    "message": "The provided URL is not valid",
    "details": {
      "url": "not-a-valid-url"
    }
  }
}
Common error codes include:
  • INVALID_URL: The provided URL is not properly formatted
  • TIMEOUT: The request timed out
  • ACCESS_DENIED: Could not access the URL (403)
  • NOT_FOUND: URL not found (404)
  • RATE_LIMITED: Too many requests to the target website

Best Practices

  1. Rate Limiting: Implement appropriate delays between requests
  2. Error Handling: Always handle potential errors in your code
  3. Selective Extraction: Only extract the data you need
  4. Cache Results: Store and reuse results when possible
  5. Respect Robots.txt: Check if scraping is allowed for the target URL
For more details about available parameters and options, refer to our API Reference.