Update dependency scrapy to v2.14.2 [SECURITY]#137
Open
renovate[bot] wants to merge 1 commit intomasterfrom
Open
Update dependency scrapy to v2.14.2 [SECURITY]#137renovate[bot] wants to merge 1 commit intomasterfrom
renovate[bot] wants to merge 1 commit intomasterfrom
Conversation
463513b to
550fee8
Compare
550fee8 to
362f1c9
Compare
362f1c9 to
a251bd9
Compare
a251bd9 to
54ebdb3
Compare
54ebdb3 to
46e079b
Compare
46e079b to
ddfd8b8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.4.0→==2.14.2Scrapy HTTP authentication credentials potentially leaked to target websites
CVE-2021-41125 / GHSA-jwqp-28gf-p498
More information
Details
Impact
If you use
HttpAuthMiddleware(i.e. thehttp_userandhttp_passspider attributes) for HTTP authentication, all requests will expose your credentials to the request target.This includes requests generated by Scrapy components, such as
robots.txtrequests sent by Scrapy when theROBOTSTXT_OBEYsetting is set toTrue, or as requests reached through redirects.Patches
Upgrade to Scrapy 2.5.1 and use the new
http_auth_domainspider attribute to control which domains are allowed to receive the configured HTTP authentication credentials.If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.5.1 is not an option, you may upgrade to Scrapy 1.8.1 instead.
Workarounds
If you cannot upgrade, set your HTTP authentication credentials on a per-request basis, using for example the
w3lib.http.basic_auth_headerfunction to convert your credentials into a value that you can assign to theAuthorizationheader of your request, instead of defining your credentials globally usingHttpAuthMiddleware.For more information
If you have any questions or comments about this advisory:
Severity
CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:P/VC:H/VI:N/VA:N/SC:N/SI:N/SA:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Incorrect Authorization and Exposure of Sensitive Information to an Unauthorized Actor in scrapy
CVE-2022-0577 / GHSA-cjvr-mfj7-j4j8
More information
Details
Impact
If you manually define cookies on a
Requestobject, and thatRequestobject gets a redirect response, the newRequestobject scheduled to follow the redirect keeps those user-defined cookies, regardless of the target domain.Patches
Upgrade to Scrapy 2.6.0, which resets cookies when creating
Requestobjects to follow redirects¹, and drops theCookieheader if manually-defined if the redirect target URL domain name does not match the source URL domain name².If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.6.0 is not an option, you may upgrade to Scrapy 1.8.2 instead.
¹ At that point the original, user-set cookies have been processed by the cookie middleware into the global or request-specific cookiejar, with their domain restricted to the domain of the original URL, so when the cookie middleware processes the new (redirect) request it will incorporate those cookies into the new request as long as the domain of the new request matches the domain of the original request.
² This prevents cookie leaks to unintended domains even if the cookies middleware is not used.
Workarounds
If you cannot upgrade, set your cookies using a list of dictionaries instead of a single dictionary, as described in the
Requestdocumentation, and set the right domain for each cookie.Alternatively, you can disable cookies altogether, or limit target domains to domains that you trust with all your user-set cookies.
References
For more information
If you have any questions or comments about this advisory:
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy cookie-setting is not restricted based on the public suffix list
GHSA-mfjm-vh54-3f96
More information
Details
Impact
Responses from domain names whose public domain name suffix contains 1 or more periods (e.g. responses from
example.co.uk, given its public domain name suffix isco.uk) are able to set cookies that are included in requests to any other domain sharing the same domain name suffix.Patches
Upgrade to Scrapy 2.6.0, which restricts cookies with their domain set to any of those in the public suffix list.
If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.6.0 is not an option, you may upgrade to Scrapy 1.8.2 instead.
Workarounds
The only workaround for unpatched versions of Scrapy is to disable cookies altogether, or limit target domains to a subset that does not include domain names with one of the public domain suffixes affected (those with 1 or more periods).
References
For more information
If you have any questions or comments about this advisory:
Severity
Medium
References
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy before 2.6.2 and 1.8.3 vulnerable to one proxy sending credentials to another
GHSA-9x8m-2xpf-crp3
More information
Details
Impact
When the built-in HTTP proxy downloader middleware processes a request with
proxymetadata, and thatproxymetadata includes proxy credentials, the built-in HTTP proxy downloader middleware sets theProxy-Authenticationheader, but only if that header is not already set.There are third-party proxy-rotation downloader middlewares that set different
proxymetadata every time they process a request.Because of request retries and redirects, the same request can be processed by downloader middlewares more than once, including both the built-in HTTP proxy downloader middleware and any third-party proxy-rotation downloader middleware.
These third-party proxy-rotation downloader middlewares could change the
proxymetadata of a request to a new value, but fail to remove theProxy-Authenticationheader from the previous value of theproxymetadata, causing the credentials of one proxy to be leaked to a different proxy.If you rotate proxies from different proxy providers, and any of those proxies requires credentials, you are affected, unless you are handling proxy rotation as described under Workarounds below. If you use a third-party downloader middleware for proxy rotation, the same applies to that downloader middleware, and installing a patched version of Scrapy may not be enough; patching that downloader middlware may be necessary as well.
Patches
Upgrade to Scrapy 2.6.2.
If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.6.2 is not an option, you may upgrade to Scrapy 1.8.3 instead.
Workarounds
If you cannot upgrade, make sure that any code that changes the value of the
proxyrequest meta also removes theProxy-Authorizationheader from the request if present.For more information
If you have any questions or comments about this advisory:
Severity
Medium
References
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy vulnerable to ReDoS via XMLFeedSpider
CVE-2024-1892 / GHSA-cc65-xxvf-f7r9
More information
Details
Impact
The following parts of the Scrapy API were found to be vulnerable to a ReDoS attack:
The
XMLFeedSpiderclass or any subclass that uses the default node iterator:iternodes, as well as direct uses of thescrapy.utils.iterators.xmliterfunction.Scrapy 2.6.0 to 2.11.0: The
open_in_browserfunction for a response without a base tag.Handling a malicious response could cause extreme CPU and memory usage during the parsing of its content, due to the use of vulnerable regular expressions for that parsing.
Patches
Upgrade to Scrapy 2.11.1.
If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.11.1 is not an option, you may upgrade to Scrapy 1.8.4 instead.
Workarounds
For
XMLFeedSpider, switch the node iterator toxmlorhtml.For
open_in_browser, before using the function, either manually review the response content to discard a ReDos attack or manually define the base tag to avoid its automatic definition byopen_in_browserlater.Acknowledgements
This security issue was reported by @nicecatch2000 through huntr.com.
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy authorization header leakage on cross-domain redirect
CVE-2024-3574 / GHSA-cw9j-q3vf-hrrv
More information
Details
Impact
When you send a request with the
Authorizationheader to one domain, and the response asks to redirect to a different domain, Scrapy’s built-in redirect middleware creates a follow-up redirect request that keeps the originalAuthorizationheader, leaking its content to that second domain.The right behavior would be to drop the
Authorizationheader instead, in this scenario.Patches
Upgrade to Scrapy 2.11.1.
If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.11.1 is not an option, you may upgrade to Scrapy 1.8.4 instead.
Workarounds
If you cannot upgrade, make sure that you are not using the
Authenticationheader, either directly or through some third-party plugin.If you need to use that header in some requests, add
"dont_redirect": Trueto therequest.metadictionary of those requests to disable following redirects for them.If you need to keep (same domain) redirect support on those requests, make sure you trust the target website not to redirect your requests to a different domain.
Acknowledgements
This security issue was reported by @ranjit-git through huntr.com.
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy decompression bomb vulnerability
CVE-2024-3572 / GHSA-7j7m-v7m3-jqm7
More information
Details
Impact
Scrapy limits allowed response sizes by default through the
DOWNLOAD_MAXSIZEandDOWNLOAD_WARNSIZEsettings.However, those limits were only being enforced during the download of the raw, usually-compressed response bodies, and not during decompression, making Scrapy vulnerable to decompression bombs.
A malicious website being scraped could send a small response that, on decompression, could exhaust the memory available to the Scrapy process, potentially affecting any other process sharing that memory, and affecting disk usage in case of uncompressed response caching.
Patches
Upgrade to Scrapy 2.11.1.
If you are using Scrapy 1.8 or a lower version, and upgrading to Scrapy 2.11.1 is not an option, you may upgrade to Scrapy 1.8.4 instead.
Workarounds
There is no easy workaround.
Disabling HTTP decompression altogether is impractical, as HTTP compression is a rather common practice.
However, it is technically possible to manually backport the 2.11.1 or 1.8.4 fix, replacing the corresponding components of an unpatched version of Scrapy with patched versions copied into your own code.
Acknowledgements
This security issue was reported by @dmandefy through huntr.com.
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy leaks the authorization header on same-domain but cross-origin redirects
CVE-2024-1968 / GHSA-4qqq-9vqf-3h3f
More information
Details
Impact
Since version 2.11.1, Scrapy drops the
Authorizationheader when a request is redirected to a different domain. However, it keeps the header if the domain remains the same but the scheme (http/https) or the port change, all scenarios where the header should also be dropped.In the context of a man-in-the-middle attack, this could be used to get access to the value of that
AuthorizationheaderPatches
Upgrade to Scrapy 2.11.2.
Workarounds
There is no easy workaround for unpatched versions of Scrapy. You can replace the built-in redirect middlewares with custom ones patched for this issue, but you have to patch them yourself, manually.
References
This security issue was reported and fixed by @szarny at https://huntr.com/bounties/27f6a021-a891-446a-ada5-0226d619dd1a/.
Severity
CVSS:3.1/AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:N/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy's redirects ignoring scheme-specific proxy settings
GHSA-jm3v-qxmh-hxwv
More information
Details
Impact
When using system proxy settings, which are scheme-specific (i.e. specific to
http://orhttps://URLs), Scrapy was not accounting for scheme changes during redirects.For example, an HTTP request would use the proxy configured for HTTP and, when redirected to an HTTPS URL, the new HTTPS request would still use the proxy configured for HTTP instead of switching to the proxy configured for HTTPS. Same the other way around.
If you have different proxy configurations for HTTP and HTTPS in your system for security reasons (e.g., maybe you don’t want one of your proxy providers to be aware of the URLs that you visit with the other one), this would be a security issue.
Patches
Upgrade to Scrapy 2.11.2.
Workarounds
Replace the built-in retry middlewares (
RedirectMiddlewareandMetaRefreshMiddleware) and theHttpProxyMiddlewaremiddleware with custom ones that implement the fix from Scrapy 2.11.2, and verify that they work as intended.References
This security issue was reported by @redapphttps://github.com/scrapy/scrapy/issues/767es/767.
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy allows redirect following in protocols other than HTTP
GHSA-23j4-mw76-5v7h
More information
Details
Impact
Scrapy was following redirects regardless of the URL protocol, so redirects were working for
data://,file://,ftp://,s3://, and any other scheme defined in theDOWNLOAD_HANDLERSsetting.However, HTTP redirects should only work between URLs that use the
http://orhttps://schemes.A malicious actor, given write access to the start requests (e.g. ability to define
start_urls) of a spider and read access to the spider output, could exploit this vulnerability to:file://scheme to read its contents.ftp://URL of a malicious FTP server to obtain the FTP username and password configured in the spider or project.s3://URL to read its content using the S3 credentials configured in the spider or project.For
file://ands3://, how the spider implements its parsing of input data into an output item determines what data would be vulnerable. A spider that always outputs the entire contents of a response would be completely vulnerable, while a spider that extracted only fragments from the response could significantly limit vulnerable data.Patches
Upgrade to Scrapy 2.11.2.
Workarounds
Replace the built-in retry middlewares (
RedirectMiddlewareandMetaRefreshMiddleware) with custom ones that implement the fix from Scrapy 2.11.2, and verify that they work as intended.References
This security issue was reported by @mvsanthttps://github.com/scrapy/scrapy/issues/457es/457.
Severity
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:NReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy is vulnerable to a denial of service (DoS) attack due to flaws in brotli decompression implementation
CVE-2025-6176 / GHSA-2qfp-q593-8484
More information
Details
Scrapy versions up to 2.13.3 are vulnerable to a denial of service (DoS) attack due to a flaw in its brotli decompression implementation. The protection mechanism against decompression bombs fails to mitigate the brotli variant, allowing remote servers to crash clients with less than 80GB of available memory. This occurs because brotli can achieve extremely high compression ratios for zero-filled data, leading to excessive memory consumption during decompression. Mitigation for this vulnerability needs security enhancement added in brotli v1.2.0.
Severity
CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Scrapy: Arbitrary Module Import via Referrer-Policy Header in RefererMiddleware
GHSA-cwxj-rr6w-m6w7
More information
Details
Impact
Since version 1.4.0, Scrapy respects the
Referrer-Policyresponse header to decide whether and how to set aRefererheader on follow-up requests.If the header value looked like a valid Python import path, Scrapy would import the referenced object and call it, assuming it referred to a referrer policy class (for example,
scrapy.spidermiddlewares.referer.DefaultReferrerPolicy) and attempting to instantiate it to handle theRefererheader.A malicious site could exploit this by setting
Referrer-Policyto a path such assys.exit, causing Scrapy to import and execute it and potentially terminate the process.Patches
Upgrade to Scrapy 2.14.2 (or later).
Workarounds
If you cannot upgrade to Scrapy 2.14.2, consider the following mitigations.
Refererheader on follow-up requests, setREFERER_ENABLEDtoFalse.Referer, disable the middleware and set the header explicitly on the requests that require it.referrer_policyin request metadata: If disabling the middleware is not viable, set thereferrer_policyrequest meta key on all requests to prevent evaluating preceding responses'Referrer-Policy. For example:Instead of editing requests individually, you can:
referrer_policymeta key; orIf you want to continue respecting legitimate
Referrer-Policyheaders while protecting against malicious ones, disable the built-in referrer policy middleware by setting it toNoneinSPIDER_MIDDLEWARESand replace it with the fixed implementation from Scrapy 2.14.2.If the Scrapy 2.14.2 implementation is incompatible with your project (for example, because your Scrapy version is older), copy the corresponding middleware from your Scrapy version, apply the same patch, and use that as a replacement.
Severity
CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:HReferences
This data is provided by the GitHub Advisory Database (CC-BY 4.0).
Release Notes
scrapy/scrapy (scrapy)
v2.14.2Compare Source
Referrer-Policyheader of HTTP responses are no longer executed as Python callables. See the cwxj-rr6w-m6w7 security advisory for details.Full Changelog
v2.14.1Compare Source
maybeDeferred_coro(){open,close}_spider()Full Changelog
v2.14.0Compare Source
DownloaderAwarePriorityQueueFull changelog
v2.13.4Compare Source
Fix for the CVE-2025-6176 security issue: improved protection against decompression bombs in
HttpCompressionMiddlewarefor responses compressed using thebranddeflatemethods. Requiresbrotli >= 1.2.0.Full changelog
v2.13.3Compare Source
DOWNLOAD_DELAY(from0to1) andCONCURRENT_REQUESTS_PER_DOMAIN(from8to1) in the default project template.See the full changelog
v2.13.2Compare Source
See the full changelog
v2.13.1Compare Source
See the full changelog
v2.13.0Compare Source
start_requests()(sync) withstart()(async) and changed how it is iterated.allow_offsiterequest meta keySee the full changelog
v2.12.0Compare Source
start_requestscan now yield itemsscrapy.http.JsonResponseCLOSESPIDER_PAGECOUNT_NO_ITEMsettingSee the full changelog.
v2.11.2Compare Source
Mostly bug fixes, including security bug fixes.
See the full changelog.
v2.11.1Compare Source
See the full changelog.
v2.11.0Compare Source
from_crawlermethods, e.g. based on spider arguments.See the full changelog.
v2.10.1Compare Source
Marked
Twisted >= 23.8.0as unsupported.v2.10.0Compare Source
See the full changelog.
v2.9.0Compare Source
See the full changelog.
v2.8.0Compare Source
This is a maintenance release, with minor features, bug fixes, and cleanups.
See the full changelog.
v2.7.1Compare Source
Proxy-Authenticationheader can again be set explicitly in certain cases, restoring compatibility with scrapy-zyte-smartproxy 2.1.0 and olderSee the full changelog
v2.7.0Compare Source
See the full changelog
v2.6.3Compare Source
Makes
pip install Scrapywork again.It required making changes to support pyOpenSSL 22.1.0. We had to drop support for SSLv3 as a result.
We also upgraded the minimum versions of some dependencies.
See the changelog.
v2.6.2Compare Source
Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0.
See the changelog.
v2.6.1Compare Source
Fixes a regression introduced in 2.6.0 that would unset the request method when following redirects.
v2.6.0Compare Source
pathlib.Pathoutput paths and per-feed item filtering and post-processingSee the full changelog
Security bug fixes
When a
Requestobject with cookies defined gets a redirect response causing a newRequestobject to be scheduled, the cookies defined in the originalRequestobject are no longer copied into the newRequestobject.If you manually set the
Cookieheader on aRequestobject and the domain name of the redirect URL is not an exact match for the domain of the URL of the originalRequestobject, yourCookieheader is now dropped from the newRequestobject.The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more
information.
Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g.
example.comand any subdomain) by defining the shared domain suffix (e.g.example.com) as the cookie domain when defining your cookies. See the documentation of theRequestclass for more information.When the domain of a cookie, either received in the
Set-Cookieheader of a response or defined in aRequestobject, is set to apublic suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.
v2.5.1Compare Source
Security bug fix:
If you use
HttpAuthMiddleware(i.e. thehttp_userandhttp_passspider attributes) for HTTP authentication, any request exposes your credentials to the request target.To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute,
http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.If the
http_auth_domainspider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.If you need to send the same HTTP authentication credentials to multiple domains, you can use
w3lib.http.basic_auth_headerinstead to set the value of theAuthorizationheader of your requests.If you really want your spider to send the same HTTP authentication credentials to any domain, set the
http_auth_domainspider attribute toNone.Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.
v2.5.0Compare Source
See the full changelog
v2.4.1Compare Source
Fixed feed exports overwrite support
Fixed the asyncio event loop handling, which could make code hang
Fixed the IPv6-capable DNS resolver
CachingHostnameResolverfor download handlers that callreactor.resolveFixed the output of the
genspidercommand showing placeholders instead of the import part of the generated spider module (issue 4874)Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.