Python + Scrapy

Rules for Scrapy web scraping projects covering spider design, item pipelines, middleware, rate limiting, data cleaning, and ethical scraping practices.

Python/

Scrapy

Details

Language

Python

Framework

Scrapy

Rules Content

AGENTS.md

Edit in Builder

Python + Scrapy Agent Rules

Project Context

You are building web scrapers with the Scrapy framework. Scrapy's asynchronous Twisted engine crawls multiple pages concurrently. The design priorities are: polite crawling, clean data extraction through item loaders, and robust error handling for production-grade reliability.

Code Style & Structure

- Follow PEP 8. Use `ruff` for linting and formatting. Add type hints to all spider attributes and method signatures.

- Document every spider class with a docstring: target site, data extracted, expected output format, and known limitations.

- Define URL constants and CSS/XPath selectors as class-level attributes, not inline strings.

- Use `logging.getLogger(__name__)` for all logging. Never use bare `print()` statements.

Project Structure

```

project/

spiders/

products.py # scrapy.Spider or CrawlSpider subclasses

listings.py

items.py # Item dataclasses or scrapy.Item subclasses

loaders.py # ItemLoader subclasses with field processors

pipelines.py # Validation, dedup, storage, export pipelines

middlewares.py # Downloader and Spider middleware

settings/

base.py

development.py

production.py

tests/

fixtures/ # Saved HTML response files for unit tests

test_spiders.py

test_pipelines.py

```

Spider Patterns

- Inherit from `scrapy.Spider` for targeted scraping. Use `CrawlSpider` with `Rule` + `LinkExtractor` for recursive link following.

- Use `start_requests()` instead of `start_urls` when URLs need dynamic construction or authentication headers.

- Attach an `errback` to every `yield Request(url, callback=self.parse, errback=self.handle_error)` call. Log and handle failures explicitly.

- Never store mutable per-request state on `self` — concurrent requests share the spider instance. Pass context with `cb_kwargs` or `meta`.

- Set `custom_settings` on individual spiders to override concurrency and delay for that domain only.

- Implement pagination by yielding the next-page Request from the parse callback, not via a loop.

Item Loaders

- Use `ItemLoader` with field-specific `input_processor` and `output_processor` for every field.

- Apply `MapCompose(str.strip, remove_tags)` as the default `input_processor` for text fields.

- Use `TakeFirst()` as the default `output_processor` for single-value fields.

- Normalize at the loader level: dates to ISO 8601, prices to `Decimal`, relative URLs to absolute with `urljoin`.

- Validate required fields in a validation pipeline, not inside the spider's parse method.

Pipelines

- Order pipelines explicitly by their `ITEM_PIPELINES` priority value: validation (100), dedup (200), cleaning (300), storage (400), export (500).

- Implement a validation pipeline that calls `raise DropItem(f'Missing {field}')` for items lacking required fields.

- Implement a deduplication pipeline using a `set()` of seen identifiers or a Bloom filter for large crawls.

- Implement `open_spider(self, spider)` and `close_spider(self, spider)` for resource management (database connections, file handles).

- Use Scrapy's `ImagesPipeline` or `FilesPipeline` for media downloads. Set `FILES_STORE` to a cloud bucket path.

Middleware Configuration

- Enable `AutoThrottle` in production: `AUTOTHROTTLE_ENABLED = True`, `AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0`, `AUTOTHROTTLE_MAX_DELAY = 30`.

- Set `DOWNLOAD_DELAY = 1.0` as a floor. AutoThrottle adjusts upward based on server response times.

- Rotate User-Agent strings with `scrapy-fake-useragent` or a custom `RandomUserAgentMiddleware`.

- Use `RetryMiddleware` with `RETRY_TIMES = 3` and add `503`, `429` to `RETRY_HTTP_CODES`.

- Keep `ROBOTSTXT_OBEY = True` by default. Override only on explicitly approved targets.

- Set a descriptive `USER_AGENT` that includes project name and contact email.

Error Handling

- Handle Twisted errors in errbacks: `from twisted.internet.error import TimeoutError, ConnectionRefusedError`.

- Log failed request URLs with `spider.name`, HTTP status, and error type for post-crawl review.

- Use spider signals `spider_error` and `item_error` for centralized error aggregation.

- Log crawl stats at the end of every run via `spider_closed` signal. Alert if `item_scraped_count == 0`.

- Set `DOWNLOAD_TIMEOUT = 30` to prevent indefinite hangs on unresponsive servers.

Rate Limiting & Politeness

- Set `CONCURRENT_REQUESTS_PER_DOMAIN = 4` as the default. Lower for sensitive targets.

- Use `DOWNLOAD_DELAY` combined with `RANDOMIZE_DOWNLOAD_DELAY = True` to add jitter.

- Respect `Retry-After` headers from rate-limit responses (429) in a custom retry middleware.

- Never scrape user-generated content at rates that could disrupt the target service.

Testing

- Write spider unit tests with `scrapy.http.HtmlResponse(url, body=open('fixtures/page.html', 'rb').read())`.

- Test item loaders independently: instantiate the loader with a mock response, call `load_item()`, assert field values.

- Test pipelines by calling `pipeline.process_item(item, mock_spider)` directly. Test `DropItem` paths.

- Use `betamax` or `vcrpy` to record and replay real HTTP interactions for integration tests.

- Validate scraped output against a Pydantic schema in CI to catch selector breakage early.

Related Templates

Python + FastAPI

High-performance Python API development with FastAPI, Pydantic, and async patterns.

Python + Django

Django web development with class-based views, ORM best practices, and DRF.

Python + Flask

Lightweight Python web development with Flask and extensions.

Back to Templates

Python + Scrapy

Rules for Scrapy web scraping projects covering spider design, item pipelines, middleware, rate limiting, data cleaning, and ethical scraping practices.

Python/

Scrapy

Details

Language

Python

Framework

Scrapy

Rules Content

AGENTS.md

Edit in Builder

Python + Scrapy Agent Rules

Project Context

Code Style & Structure

- Follow PEP 8. Use `ruff` for linting and formatting. Add type hints to all spider attributes and method signatures.

- Document every spider class with a docstring: target site, data extracted, expected output format, and known limitations.

- Define URL constants and CSS/XPath selectors as class-level attributes, not inline strings.

- Use `logging.getLogger(__name__)` for all logging. Never use bare `print()` statements.

Project Structure

```

project/

spiders/

products.py # scrapy.Spider or CrawlSpider subclasses

listings.py

items.py # Item dataclasses or scrapy.Item subclasses

loaders.py # ItemLoader subclasses with field processors

pipelines.py # Validation, dedup, storage, export pipelines

middlewares.py # Downloader and Spider middleware

settings/

base.py

development.py

production.py

tests/

fixtures/ # Saved HTML response files for unit tests

test_spiders.py

test_pipelines.py

```

Spider Patterns

- Inherit from `scrapy.Spider` for targeted scraping. Use `CrawlSpider` with `Rule` + `LinkExtractor` for recursive link following.

- Use `start_requests()` instead of `start_urls` when URLs need dynamic construction or authentication headers.

- Attach an `errback` to every `yield Request(url, callback=self.parse, errback=self.handle_error)` call. Log and handle failures explicitly.

- Never store mutable per-request state on `self` — concurrent requests share the spider instance. Pass context with `cb_kwargs` or `meta`.

- Set `custom_settings` on individual spiders to override concurrency and delay for that domain only.

- Implement pagination by yielding the next-page Request from the parse callback, not via a loop.

Item Loaders

- Use `ItemLoader` with field-specific `input_processor` and `output_processor` for every field.

- Apply `MapCompose(str.strip, remove_tags)` as the default `input_processor` for text fields.

- Use `TakeFirst()` as the default `output_processor` for single-value fields.

- Normalize at the loader level: dates to ISO 8601, prices to `Decimal`, relative URLs to absolute with `urljoin`.

- Validate required fields in a validation pipeline, not inside the spider's parse method.

Pipelines

- Order pipelines explicitly by their `ITEM_PIPELINES` priority value: validation (100), dedup (200), cleaning (300), storage (400), export (500).

- Implement a validation pipeline that calls `raise DropItem(f'Missing {field}')` for items lacking required fields.

- Implement a deduplication pipeline using a `set()` of seen identifiers or a Bloom filter for large crawls.

- Implement `open_spider(self, spider)` and `close_spider(self, spider)` for resource management (database connections, file handles).

- Use Scrapy's `ImagesPipeline` or `FilesPipeline` for media downloads. Set `FILES_STORE` to a cloud bucket path.

Middleware Configuration

- Enable `AutoThrottle` in production: `AUTOTHROTTLE_ENABLED = True`, `AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0`, `AUTOTHROTTLE_MAX_DELAY = 30`.

- Set `DOWNLOAD_DELAY = 1.0` as a floor. AutoThrottle adjusts upward based on server response times.

- Rotate User-Agent strings with `scrapy-fake-useragent` or a custom `RandomUserAgentMiddleware`.

- Use `RetryMiddleware` with `RETRY_TIMES = 3` and add `503`, `429` to `RETRY_HTTP_CODES`.

- Keep `ROBOTSTXT_OBEY = True` by default. Override only on explicitly approved targets.

- Set a descriptive `USER_AGENT` that includes project name and contact email.

Error Handling

- Handle Twisted errors in errbacks: `from twisted.internet.error import TimeoutError, ConnectionRefusedError`.

- Log failed request URLs with `spider.name`, HTTP status, and error type for post-crawl review.

- Use spider signals `spider_error` and `item_error` for centralized error aggregation.

- Log crawl stats at the end of every run via `spider_closed` signal. Alert if `item_scraped_count == 0`.

- Set `DOWNLOAD_TIMEOUT = 30` to prevent indefinite hangs on unresponsive servers.

Rate Limiting & Politeness

- Set `CONCURRENT_REQUESTS_PER_DOMAIN = 4` as the default. Lower for sensitive targets.

- Use `DOWNLOAD_DELAY` combined with `RANDOMIZE_DOWNLOAD_DELAY = True` to add jitter.

- Respect `Retry-After` headers from rate-limit responses (429) in a custom retry middleware.

- Never scrape user-generated content at rates that could disrupt the target service.

Testing

- Write spider unit tests with `scrapy.http.HtmlResponse(url, body=open('fixtures/page.html', 'rb').read())`.

- Test item loaders independently: instantiate the loader with a mock response, call `load_item()`, assert field values.

- Test pipelines by calling `pipeline.process_item(item, mock_spider)` directly. Test `DropItem` paths.

- Use `betamax` or `vcrpy` to record and replay real HTTP interactions for integration tests.

- Validate scraped output against a Pydantic schema in CI to catch selector breakage early.

Related Templates

Python + FastAPI

High-performance Python API development with FastAPI, Pydantic, and async patterns.

Python + Django

Django web development with class-based views, ORM best practices, and DRF.

Python + Flask

Lightweight Python web development with Flask and extensions.