Refactor subscription polling to deduplicate URLs before testing#767
Merged
Refactor subscription polling to deduplicate URLs before testing#767
Conversation
Fixes #626. Subscriptions often share the same proxy addresses. The previous approach called process_line (which includes a network connectivity check) once per address per subscription, so duplicates were tested multiple times. New two-phase approach in poll_subscriptions: 1. Fetch all subscriptions and normalize each line into a SIP002 URL, collecting results into a single set — deduplication is free. 2. Run connectivity tests (get_proxy_location) in parallel only against the deduplicated set of unknown URLs. process_line is preserved (now delegates to the two new helpers) so existing call sites and tests continue to work unchanged. Also adds ExtractSip002UrlTest covering the new extract_sip002_url helper. https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
extract_sip002_url now only validates/normalizes the ss:// format — it no
longer accepts or checks all_urls. In poll_subscriptions the three steps
are now clearly separated:
1. Collect all valid SIP002 URLs across all subscriptions into a set
(the set collapses duplicates automatically).
2. Subtract addresses already in the database (candidate_urls -= all_urls)
so only truly new addresses proceed.
3. Connectivity-test the remaining unique, new-only URLs in parallel.
process_line retains its all_urls parameter and now performs the DB check
explicitly (url in all_urls) after calling extract_sip002_url.
Tests updated to match the new no-param signature of extract_sip002_url;
the "already known URL" case is moved to ProcessLineRejectsKnownUrlTest.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
…B filtering
PollSubscriptionsDeduplicationTest mocks two subscriptions (one PLAIN,
one BASE64, matching subscriptions.json fixtures):
List 1 (PLAIN): unique-one, shared-password, <DB address>
List 2 (BASE64): shared-password, shared-password (dup), unique-two, <DB address>
Verifies that after running poll_subscriptions:
- The address already in the database (proxies.json pk=6) is never
passed to get_proxy_location (DB subtraction step works).
- shared-password appears across both lists and twice within the BASE64
list, but get_proxy_location is called for it exactly once (set
deduplication works).
- Exactly 3 new proxies are saved (unique-one, shared-password,
unique-two); the existing DB proxy count stays at 1.
update_proxy_status is mocked with a side-effect that sets a non-empty
location so the post-save signal recursion terminates cleanly.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
Replace substring checks ("githubusercontent.com" in url) with exact
equality checks (url == "https://...") so CodeQL's incomplete URL
sanitization rule is satisfied.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
Extract three helpers to make the two phases readable at a glance:
_decode_subscription_lines(r, subscription) -> list[str]
Decodes an HTTP response into proxy lines based on the
subscription kind (PLAIN or BASE64).
_collect_candidate_urls(subscriptions) -> set[str]
Fetches every enabled subscription, extracts and deduplicates
valid SIP002 addresses, and persists alive/error state.
_test_candidate_urls(candidate_urls) -> list[Proxy | None]
Connectivity-tests the deduplicated, new-only addresses in
parallel via ThreadPoolExecutor.
poll_subscriptions is now a short orchestrator: load DB URLs,
collect candidates, subtract DB, test, save.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
- Remove flatten() — never called after the proxies_lists refactor - Remove Iterator import — was only used by flatten - Remove process_line() — no production callers; its scheme-rejection behaviour is already covered by ExtractSip002UrlTest - Remove ProcessLineTest and ProcessLineRejectsKnownUrlTest - Flatten save_proxies signature: accepts list[Proxy | None] directly instead of a list-of-lists (it was always called with a single-element outer list) https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
When ProcessLineTest was removed, three protocols fell out of coverage: hysteria://, hy2://, and tuic:// (present in NON_SS_SCHEMES but never explicitly tested). Also split test_rejects_http into separate cases for http:// and https://. https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
The check was dead code: lines reaching it had already passed
startswith("ss://"), so they could never match any NON_SS_SCHEMES
prefix. The constant is now unused and removed as well.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restructured the
poll_subscriptions()task to separate concerns into two distinct phases: fetching and parsing subscriptions, then testing connectivity for deduplicated URLs. This improves efficiency by avoiding redundant connectivity tests for duplicate proxy addresses across subscriptions.Key Changes
Two-phase architecture:
Deduplication before testing: Candidate URLs are deduplicated using a set and filtered against existing database entries before any connectivity testing occurs, reducing unnecessary network calls
Extracted helper functions:
extract_sip002_url(): Validates and normalizes SIP002 URL format without DB checks or connectivity testingtest_and_create_proxy(): Tests connectivity for a single URL and returns a Proxy object if reachableprocess_line(): Refactored to use the new helpers while maintaining backward compatibilityRemoved ThreadPoolExecutor nesting: Eliminated the nested executor.map() calls that were creating intermediate iterators; now uses a single executor.map() call on the deduplicated candidate set
Improved logging: Added phase-specific log messages and count tracking for better observability
Enhanced test coverage: Added
ExtractSip002UrlTestclass to validate URL format extraction independently, andProcessLineRejectsKnownUrlTestto verify duplicate rejectionImplementation Details
The refactoring maintains the same error handling and subscription state management while improving the execution model. The separation of URL extraction from connectivity testing allows for cleaner, more testable code and better resource utilization when processing subscriptions with overlapping proxy lists.
https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw