Python + Scrapy Agent Rules
Project Context
You are building web scrapers with the Scrapy framework. Scrapy's asynchronous Twisted engine crawls multiple pages concurrently. The design priorities are: polite crawling, clean data extraction through item loaders, and robust error handling for production-grade reliability.
Code Style & Structure
- Follow PEP 8. Use `ruff` for linting and formatting. Add type hints to all spider attributes and method signatures.
- Document every spider class with a docstring: target site, data extracted, expected output format, and known limitations.
- Define URL constants and CSS/XPath selectors as class-level attributes, not inline strings.
- Use `logging.getLogger(__name__)` for all logging. Never use bare `print()` statements.
Project Structure
```
project/
spiders/
products.py # scrapy.Spider or CrawlSpider subclasses
listings.py
items.py # Item dataclasses or scrapy.Item subclasses
loaders.py # ItemLoader subclasses with field processors
pipelines.py # Validation, dedup, storage, export pipelines
middlewares.py # Downloader and Spider middleware
settings/
base.py
development.py
production.py
tests/
fixtures/ # Saved HTML response files for unit tests
test_spiders.py
test_pipelines.py
```
Spider Patterns
- Inherit from `scrapy.Spider` for targeted scraping. Use `CrawlSpider` with `Rule` + `LinkExtractor` for recursive link following.
- Use `start_requests()` instead of `start_urls` when URLs need dynamic construction or authentication headers.
- Attach an `errback` to every `yield Request(url, callback=self.parse, errback=self.handle_error)` call. Log and handle failures explicitly.
- Never store mutable per-request state on `self` — concurrent requests share the spider instance. Pass context with `cb_kwargs` or `meta`.
- Set `custom_settings` on individual spiders to override concurrency and delay for that domain only.
- Implement pagination by yielding the next-page Request from the parse callback, not via a loop.
Item Loaders
- Use `ItemLoader` with field-specific `input_processor` and `output_processor` for every field.
- Apply `MapCompose(str.strip, remove_tags)` as the default `input_processor` for text fields.
- Use `TakeFirst()` as the default `output_processor` for single-value fields.
- Normalize at the loader level: dates to ISO 8601, prices to `Decimal`, relative URLs to absolute with `urljoin`.
- Validate required fields in a validation pipeline, not inside the spider's parse method.
Pipelines
- Order pipelines explicitly by their `ITEM_PIPELINES` priority value: validation (100), dedup (200), cleaning (300), storage (400), export (500).
- Implement a validation pipeline that calls `raise DropItem(f'Missing {field}')` for items lacking required fields.
- Implement a deduplication pipeline using a `set()` of seen identifiers or a Bloom filter for large crawls.
- Implement `open_spider(self, spider)` and `close_spider(self, spider)` for resource management (database connections, file handles).
- Use Scrapy's `ImagesPipeline` or `FilesPipeline` for media downloads. Set `FILES_STORE` to a cloud bucket path.
Middleware Configuration
- Enable `AutoThrottle` in production: `AUTOTHROTTLE_ENABLED = True`, `AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0`, `AUTOTHROTTLE_MAX_DELAY = 30`.
- Set `DOWNLOAD_DELAY = 1.0` as a floor. AutoThrottle adjusts upward based on server response times.
- Rotate User-Agent strings with `scrapy-fake-useragent` or a custom `RandomUserAgentMiddleware`.
- Use `RetryMiddleware` with `RETRY_TIMES = 3` and add `503`, `429` to `RETRY_HTTP_CODES`.
- Keep `ROBOTSTXT_OBEY = True` by default. Override only on explicitly approved targets.
- Set a descriptive `USER_AGENT` that includes project name and contact email.
Error Handling
- Handle Twisted errors in errbacks: `from twisted.internet.error import TimeoutError, ConnectionRefusedError`.
- Log failed request URLs with `spider.name`, HTTP status, and error type for post-crawl review.
- Use spider signals `spider_error` and `item_error` for centralized error aggregation.
- Log crawl stats at the end of every run via `spider_closed` signal. Alert if `item_scraped_count == 0`.
- Set `DOWNLOAD_TIMEOUT = 30` to prevent indefinite hangs on unresponsive servers.
Rate Limiting & Politeness
- Set `CONCURRENT_REQUESTS_PER_DOMAIN = 4` as the default. Lower for sensitive targets.
- Use `DOWNLOAD_DELAY` combined with `RANDOMIZE_DOWNLOAD_DELAY = True` to add jitter.
- Respect `Retry-After` headers from rate-limit responses (429) in a custom retry middleware.
- Never scrape user-generated content at rates that could disrupt the target service.
Testing
- Write spider unit tests with `scrapy.http.HtmlResponse(url, body=open('fixtures/page.html', 'rb').read())`.
- Test item loaders independently: instantiate the loader with a mock response, call `load_item()`, assert field values.
- Test pipelines by calling `pipeline.process_item(item, mock_spider)` directly. Test `DropItem` paths.
- Use `betamax` or `vcrpy` to record and replay real HTTP interactions for integration tests.
- Validate scraped output against a Pydantic schema in CI to catch selector breakage early.