Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests, allowing to bypass most anti-bot protections (e.g. CloudFlare).
Using Selenium's pure CDP mode also makes the middleware more platform independent as no WebDriver is required.
- Installation
- Configuration
- Usage
- Error handling
- Enabling debug logs
- Tips for headless Linux environments
- Architecture
- License
pip install scrapy-seleniumbase-cdp
-
Add the
SeleniumBaseAsyncCDPMiddlewareto the downloader middlewares:DOWNLOADER_MIDDLEWARES = { 'scrapy_seleniumbase_cdp.SeleniumBaseAsyncCDPMiddleware': 800 }
-
If needed, configuration can be provided to the SeleniumBase browser instance. For example, to enable the built-in ad blocker (blocks 30+ ad and tracking domains via CDP):
SELENIUMBASE_BROWSER_OPTIONS = { 'ad_block': True, }
To have SeleniumBase handle requests, use the
scrapy_seleniumbase_cdp.SeleniumBaseRequest instead of Scrapy's built-in
Request:
from scrapy_seleniumbase_cdp import SeleniumBaseRequest
async def start(self):
yield SeleniumBaseRequest(url=url, callback=self.parse_result)The scrapy_seleniumbase_cdp.SeleniumBaseRequest accepts additional
arguments. They are executed in the order presented below:
Maximum number of seconds to wait for both the HTTP response and the page load
event before proceeding. If the timeout is reached, a warning is logged but the
request continues. Defaults to 10.
After navigating to a page, the middleware waits for both the HTTP response status and the page load event. It then attempts to solve any captcha present on the page using SeleniumBase's built-in solver, retrying up to a configurable maximum number of attempts.
The delay before the first solve attempt and between retries depends on the HTTP status code:
- 2xx responses: wait
captcha_delayseconds (default0) - Blocked responses (status in
captcha_blocked_codes): waitcaptcha_blocked_delayseconds (default4)
yield SeleniumBaseRequest(
url=url,
callback=self.parse_result,
captcha_delay=1,
captcha_blocked_delay=5,
captcha_blocked_codes=[403, 429, 503],
captcha_max_attempts=5)Available captcha configuration:
captcha_delay: Seconds to wait before solving on a successful response. Defaults to0.captcha_blocked_delay: Seconds to wait before solving on a blocked response. Defaults to4.captcha_blocked_codes: List of HTTP status codes treated as blocked. Defaults to[403, 429, 503].captcha_max_attempts: Maximum number of solve attempts. Defaults to3. After exhausting all attempts the middleware continues normally but logs a warning.
When used, SeleniumBase will wait for the element with the given CSS selector
to appear. The default timeout value is of 10 seconds but can be changed if
needed. If the element is not found within the timeout, a full-page error
screenshot is captured and stored in request.meta['error_screenshot'], then
the request is skipped (Scrapy's IgnoreRequest is raised). The screenshot
image format is taken from the request's screenshot configuration if set,
otherwise it defaults to PNG.
The error screenshot is accessible in the request's errback via
failure.request.meta['error_screenshot']:
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError, TimeoutError
yield SeleniumBaseRequest(
url=url,
callback=self.parse_result,
errback=self.handle_error,
wait_for_element='h1.some-class',
element_timeout=5)
def handle_error(self, failure):
screenshot = failure.request.meta.get('error_screenshot')
if screenshot:
with open('error.png', 'wb') as f:
f.write(screenshot)If needed, it is possible to provide a callback to interact with the browser
instance and/or its tabs. The return value of the async callback is stored in
response.meta['callback'].
async def start(self):
async def maximize_window(browser: Browser):
await browser.main_tab.maximize()
yield SeleniumBaseRequest(…, browser_callback=maximize_window)When used, SeleniumBase will execute the provided JavaScript code.
yield SeleniumBaseRequest(
# …
script='window.scrollTo(0, document.body.scrollHeight)')If the script returns a Promise, it is possible to await its result:
yield SeleniumBaseRequest(
# …
script={
'await_promise': True,
'script': '''
document.getElementById('onetrust-accept-btn-handler').click()
new Promise(resolve => setTimeout(resolve, 1000))
'''
})The result of the JavaScript code is stored in response.meta['script'].
When used, SeleniumBase will take a screenshot of the page and the binary data
will be stored in response.meta['screenshot']:
yield SeleniumBaseRequest(url=url, callback=self.parse_result, screenshot=True)
def parse_result(self, response):
# …
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])You can also specify additional configuration options:
yield SeleniumBaseRequest(…, screenshot={'format': 'jpg', 'full_page': False})Or provide a path to automatically save the screenshot (in this case, the image data is not stored in the response):
yield SeleniumBaseRequest(…, screenshot={'path': 'output/image.png'})Available configuration keys:
path: File path where screenshot will be saved. Useautofor SeleniumBase default path. Leave empty to return data in responsemeta.format: Image format, defaults topng,jpgalso available.full_page: Capture full page or just viewport, defaults toTrue.
The middleware checks the HTTP status code right after loading the page to determine captcha-solving behaviour (see Captcha handling above).
wait_for_elementtimeout: if the expected element is not found withinelement_timeoutseconds, a full-page error screenshot is captured and stored inrequest.meta['error_screenshot'], thenIgnoreRequestis raised, causing Scrapy to skip the request. The screenshot is accessible in the request'serrbackviafailure.request.meta['error_screenshot'](seewait_for_elementfor an example).
When running Scrapy with this middleware in headless mode using Xvfb on Linux,
you may want to record or visually inspect browser sessions for debugging
purposes. The examples below assume an Xvfb display at :1001 — adjust to
match your setup.
Use ffmpeg to capture the virtual display as a video file:
ffmpeg -f x11grab -r 30 -s 1440x900 -i :1001 \
-codec:v libx264 -preset ultrafast -pix_fmt yuv420p \
/home/user/session_$(date +%Y%m%d_%H%M%S).mp4Key flags:
-f x11grab— capture from an X11 display-r 30— frame rate (30 fps)-s 1440x900— resolution (must match your Xvfb geometry)-i :1001— X display to capture
Use x11vnc to expose the virtual display over VNC for live inspection:
x11vnc -display :1001 -passwd secret -forever -xkbThen connect from any VNC client to <host>:5900. Key flags:
-display :1001— X display to share-passwd secret— VNC password-forever— keep the server running after the first client disconnects-xkb— use XKEYBOARD extension for better keyboard handling
The middleware logs operational details (page load events, captcha attempts,
screenshot captures, etc.) at the DEBUG level. Log messages are emitted under
the scrapy_seleniumbase_cdp.middleware_async logger name. Warnings and errors
(page load timeouts, element wait timeouts, max captcha attempts reached) use
higher log levels and are always visible.
To see all debug output, set Scrapy's global log level in your settings.py:
LOG_LEVEL = 'DEBUG'If you prefer to keep Scrapy's own output at a higher level and only enable debug logging for this middleware, configure the parent logger directly:
# settings.py or spider __init__
import logging
logging.getLogger('scrapy_seleniumbase_cdp').setLevel(logging.DEBUG)This works because the middleware uses a module-level logger named
scrapy_seleniumbase_cdp.middleware_async, which inherits settings from the
scrapy_seleniumbase_cdp parent logger.
You can also use Scrapy's per-module log configuration via the
LOG_CATEGORIES setting (Scrapy ≥ 2.8):
LOG_CATEGORIES = {
'scrapy_seleniumbase_cdp': 'DEBUG',
}See docs/ARCHITECTURE.md for a detailed overview of
the middleware internals, including a sequence diagram of the request processing
pipeline.
This project is licensed under the MIT License. It is a fork of Quartz-Core/scrapy-seleniumbase which was originally released under the WTFPL.