Skip to content

Fix URL joining for non-HTTP schemes (s3://, file://, etc.)#867

Open
jezdez wants to merge 2 commits intomainfrom
fix/s3-url-join
Open

Fix URL joining for non-HTTP schemes (s3://, file://, etc.)#867
jezdez wants to merge 2 commits intomainfrom
fix/s3-url-join

Conversation

@jezdez
Copy link
Member

@jezdez jezdez commented Feb 5, 2026

Summary

Python's urllib.parse.urljoin doesn't handle non-HTTP URL schemes properly. When joining an s3:// URL with "." or "", it returns just "." instead of the original URL with a trailing slash. This caused package URLs to be malformed (e.g., .zlib-1.2.11-h7b6447c_3.tar.bz2 instead of the full S3 URL).

This PR adds a _safe_urljoin_with_slash() helper function that:

  • Uses standard urljoin for HTTP/HTTPS (works correctly)
  • Handles s3://, file://, ftp://, and other non-HTTP schemes manually
  • Always returns URLs with trailing slash for proper filename concatenation

Changes

  • Added _safe_urljoin_with_slash() helper in shards.py
  • Updated ShardBase.base_url property to use the new helper
  • Updated _shards_base_url() function to use the new helper
  • Added parametrized tests covering HTTP, S3, file, and FTP URL schemes

Alternative considered

An alternative fix would be to register the s3 scheme with urllib.parse globally:

from urllib.parse import uses_netloc, uses_relative
uses_netloc.append("s3")
uses_relative.append("s3")

This makes urljoin work correctly for S3 URLs. However, this approach was rejected because:

  1. Global side effects: Modifies urllib.parse behavior for all Python code in the process, not just conda/cls
  2. Behavioral changes: Changes urljoin results significantly (e.g., urljoin("s3://bucket/path", "file.txt") changes from "file.txt" to "s3://bucket/file.txt")
  3. Potential breakage: Other libraries might (incorrectly) depend on the current behavior

The local _safe_urljoin_with_slash() helper is safer as it only affects the specific code paths that need fixing.

Fixes #866

Python's urllib.parse.urljoin doesn't handle non-HTTP URL schemes properly.
When joining an s3:// URL with "." or "", it returns just "." instead of the
original URL with a trailing slash. This caused package URLs to be malformed
(e.g., ".zlib-1.2.11-h7b6447c_3.tar.bz2" instead of the full S3 URL).

Add _safe_urljoin_with_slash() helper that:
- Uses standard urljoin for HTTP/HTTPS (works correctly)
- Handles s3://, file://, ftp://, and other schemes manually
- Always returns URLs with trailing slash for proper filename concatenation

Fixes #866
@jezdez jezdez requested a review from a team as a code owner February 5, 2026 12:14
@github-project-automation github-project-automation bot moved this to 🆕 New in 🔎 Review Feb 5, 2026
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Feb 5, 2026
ZSTD_MAX_SHARD_SIZE = 2**20 * 16 # maximum size necessary when compressed data has no size header

# URL schemes that urljoin handles correctly
_URLJOIN_SAFE_SCHEMES = frozenset(("http", "https", ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't urljoin's longer list?

_URLJOIN_SAFE_SCHEMES = frozenset(("http", "https", ""))


def _safe_urljoin_with_slash(base_url: str, relative_url: str = "") -> str:
Copy link
Contributor

@jaimergp jaimergp Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to replace the scheme with http if it's something else, and then undo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed [bot] added once the contributor has signed the CLA

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

S3 channel URLs are corrupted due to urljoin not handling s3:// scheme

4 participants