Skip to content

Refactor subscription polling to deduplicate URLs before testing#767

Merged
jadolg merged 10 commits intomainfrom
claude/fix-issue-626-CeawT
Apr 9, 2026
Merged

Refactor subscription polling to deduplicate URLs before testing#767
jadolg merged 10 commits intomainfrom
claude/fix-issue-626-CeawT

Conversation

@jadolg
Copy link
Copy Markdown
Member

@jadolg jadolg commented Apr 9, 2026

Summary

Restructured the poll_subscriptions() task to separate concerns into two distinct phases: fetching and parsing subscriptions, then testing connectivity for deduplicated URLs. This improves efficiency by avoiding redundant connectivity tests for duplicate proxy addresses across subscriptions.

Key Changes

  • Two-phase architecture:

    • Phase 1 collects all candidate URLs from enabled subscriptions
    • Phase 2 tests connectivity only for new URLs not already in the database
  • Deduplication before testing: Candidate URLs are deduplicated using a set and filtered against existing database entries before any connectivity testing occurs, reducing unnecessary network calls

  • Extracted helper functions:

    • extract_sip002_url(): Validates and normalizes SIP002 URL format without DB checks or connectivity testing
    • test_and_create_proxy(): Tests connectivity for a single URL and returns a Proxy object if reachable
    • process_line(): Refactored to use the new helpers while maintaining backward compatibility
  • Removed ThreadPoolExecutor nesting: Eliminated the nested executor.map() calls that were creating intermediate iterators; now uses a single executor.map() call on the deduplicated candidate set

  • Improved logging: Added phase-specific log messages and count tracking for better observability

  • Enhanced test coverage: Added ExtractSip002UrlTest class to validate URL format extraction independently, and ProcessLineRejectsKnownUrlTest to verify duplicate rejection

Implementation Details

The refactoring maintains the same error handling and subscription state management while improving the execution model. The separation of URL extraction from connectivity testing allows for cleaner, more testable code and better resource utilization when processing subscriptions with overlapping proxy lists.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw

claude added 3 commits April 9, 2026 07:43
Fixes #626. Subscriptions often share the same proxy addresses.
The previous approach called process_line (which includes a network
connectivity check) once per address per subscription, so duplicates
were tested multiple times.

New two-phase approach in poll_subscriptions:
1. Fetch all subscriptions and normalize each line into a SIP002 URL,
   collecting results into a single set — deduplication is free.
2. Run connectivity tests (get_proxy_location) in parallel only against
   the deduplicated set of unknown URLs.

process_line is preserved (now delegates to the two new helpers) so
existing call sites and tests continue to work unchanged.

Also adds ExtractSip002UrlTest covering the new extract_sip002_url helper.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
extract_sip002_url now only validates/normalizes the ss:// format — it no
longer accepts or checks all_urls.  In poll_subscriptions the three steps
are now clearly separated:

  1. Collect all valid SIP002 URLs across all subscriptions into a set
     (the set collapses duplicates automatically).
  2. Subtract addresses already in the database (candidate_urls -= all_urls)
     so only truly new addresses proceed.
  3. Connectivity-test the remaining unique, new-only URLs in parallel.

process_line retains its all_urls parameter and now performs the DB check
explicitly (url in all_urls) after calling extract_sip002_url.

Tests updated to match the new no-param signature of extract_sip002_url;
the "already known URL" case is moved to ProcessLineRejectsKnownUrlTest.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
…B filtering

PollSubscriptionsDeduplicationTest mocks two subscriptions (one PLAIN,
one BASE64, matching subscriptions.json fixtures):

  List 1 (PLAIN):  unique-one, shared-password, <DB address>
  List 2 (BASE64): shared-password, shared-password (dup), unique-two, <DB address>

Verifies that after running poll_subscriptions:
  - The address already in the database (proxies.json pk=6) is never
    passed to get_proxy_location (DB subtraction step works).
  - shared-password appears across both lists and twice within the BASE64
    list, but get_proxy_location is called for it exactly once (set
    deduplication works).
  - Exactly 3 new proxies are saved (unique-one, shared-password,
    unique-two); the existing DB proxy count stays at 1.

update_proxy_status is mocked with a side-effect that sets a non-empty
location so the post-save signal recursion terminates cleanly.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
Comment thread proxylist/tests/test_tasks.py Fixed
claude added 5 commits April 9, 2026 08:34
Replace substring checks ("githubusercontent.com" in url) with exact
equality checks (url == "https://...") so CodeQL's incomplete URL
sanitization rule is satisfied.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
Extract three helpers to make the two phases readable at a glance:

  _decode_subscription_lines(r, subscription) -> list[str]
      Decodes an HTTP response into proxy lines based on the
      subscription kind (PLAIN or BASE64).

  _collect_candidate_urls(subscriptions) -> set[str]
      Fetches every enabled subscription, extracts and deduplicates
      valid SIP002 addresses, and persists alive/error state.

  _test_candidate_urls(candidate_urls) -> list[Proxy | None]
      Connectivity-tests the deduplicated, new-only addresses in
      parallel via ThreadPoolExecutor.

poll_subscriptions is now a short orchestrator: load DB URLs,
collect candidates, subtract DB, test, save.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
- Remove flatten() — never called after the proxies_lists refactor
- Remove Iterator import — was only used by flatten
- Remove process_line() — no production callers; its scheme-rejection
  behaviour is already covered by ExtractSip002UrlTest
- Remove ProcessLineTest and ProcessLineRejectsKnownUrlTest
- Flatten save_proxies signature: accepts list[Proxy | None] directly
  instead of a list-of-lists (it was always called with a single-element
  outer list)

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
When ProcessLineTest was removed, three protocols fell out of coverage:
hysteria://, hy2://, and tuic:// (present in NON_SS_SCHEMES but never
explicitly tested).  Also split test_rejects_http into separate cases
for http:// and https://.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
@jadolg jadolg self-assigned this Apr 9, 2026
@jadolg jadolg linked an issue Apr 9, 2026 that may be closed by this pull request
claude added 2 commits April 9, 2026 09:19
The check was dead code: lines reaching it had already passed
startswith("ss://"), so they could never match any NON_SS_SCHEMES
prefix. The constant is now unused and removed as well.

https://claude.ai/code/session_01CfdJ8qPDaFewUmo8SpgCXw
@jadolg jadolg added this pull request to the merge queue Apr 9, 2026
Merged via the queue into main with commit 592afdc Apr 9, 2026
5 checks passed
@jadolg jadolg deleted the claude/fix-issue-626-CeawT branch April 9, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce the number of checks while scraping subscriptions

3 participants