Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
30c2ed6
Find lockfiles in other subdirs
marcleblanc2 Jul 21, 2025
5d061dc
Fix git svn fetch error message processing regexes
marcleblanc2 Jul 21, 2025
14af2fa
Switch customer1 to lateste
marcleblanc2 Jul 21, 2025
4d12771
update top config file
marcleblanc2 Jul 21, 2025
047288f
Fix retry math
marcleblanc2 Jul 21, 2025
d577e5c
Add BUILD_COMMIT_MESSAGE to GHA builds
marcleblanc2 Jul 21, 2025
fbe2ad1
Shorten interval
marcleblanc2 Jul 21, 2025
e98ecd4
Add git_dir_size math
marcleblanc2 Jul 21, 2025
7d31278
Prevent deadlocks
marcleblanc2 Jul 23, 2025
5ed8c9b
Refactor pid sessions and groups, logging repo_key, fixed missing svn…
marcleblanc2 Jul 23, 2025
0db7f3a
Reset job context in main process
marcleblanc2 Jul 23, 2025
a715c28
Allow disabling TLS verification
marcleblanc2 Jul 23, 2025
d28e8bd
Include svn output in svn info errors
marcleblanc2 Jul 23, 2025
5584e35
Fix disable TLS option
marcleblanc2 Jul 23, 2025
6fb279e
Fix math with unknown NoneType
marcleblanc2 Jul 23, 2025
7eb95c1
Try this
marcleblanc2 Jul 23, 2025
8967b37
Whatever
marcleblanc2 Jul 23, 2025
65c03eb
Fit git svn init
marcleblanc2 Jul 23, 2025
1beb5d0
Refactor and simplify URLs and local file paths to ensure uniqueness
marcleblanc2 Jul 25, 2025
be32b1d
Mount host cert trust store into container
marcleblanc2 Jul 25, 2025
90b53ba
Disable TLS verification when needed
marcleblanc2 Jul 25, 2025
7c7a7d4
Try new trust-server-cert arg
marcleblanc2 Jul 25, 2025
bfe4d32
Try this
marcleblanc2 Jul 25, 2025
7ea650c
Or this
marcleblanc2 Jul 25, 2025
903e1a9
Prep for run_subprocess stdin / out interaction
marcleblanc2 Jul 28, 2025
4c6859f
svn info should work with CLI args to disable TLS cert verification
marcleblanc2 Jul 28, 2025
3e5b5f0
Fix duplicate stderr
marcleblanc2 Jul 29, 2025
ab0cfaa
Update Dockerfile for faster builds
marcleblanc2 Jul 29, 2025
fa38c49
Adding interactivity with svn CLI to trust server cert
marcleblanc2 Jul 29, 2025
92a42c5
Fix git svn init command
marcleblanc2 Jul 29, 2025
0be255d
Debug sub_process interaction
marcleblanc2 Jul 29, 2025
9b025c0
Fixed subprocess interaction
marcleblanc2 Jul 29, 2025
05ed85e
Test running svn info as shell
marcleblanc2 Jul 29, 2025
aae77b1
Add better error handling for svn info command
marcleblanc2 Jul 29, 2025
2c6b9c9
Get more verbose output
marcleblanc2 Jul 29, 2025
5531a16
Try to force interactive
marcleblanc2 Jul 29, 2025
2a69c09
Try no shell, with --force-interactive
marcleblanc2 Jul 29, 2025
8d519b5
Whoops, this part too
marcleblanc2 Jul 29, 2025
874eb87
Try stderr, I guess
marcleblanc2 Jul 29, 2025
a0b040e
Try pexpect
marcleblanc2 Jul 29, 2025
51a3d82
Fix tuple type error
marcleblanc2 Jul 29, 2025
36ede3f
Try that syntax
marcleblanc2 Jul 29, 2025
5fb776c
Or this shit?
marcleblanc2 Jul 29, 2025
8d790c3
Are you happy now??
marcleblanc2 Jul 29, 2025
f49a6ce
Why byte string and no match??
marcleblanc2 Jul 29, 2025
05246ed
Ugh
marcleblanc2 Jul 29, 2025
5ae129e
Come on man
marcleblanc2 Jul 29, 2025
b825e5e
why not
marcleblanc2 Jul 29, 2025
f852a09
sure
marcleblanc2 Jul 29, 2025
e0b5285
Try pysvn
marcleblanc2 Jul 29, 2025
da0b696
Try to fix TypeError for password
marcleblanc2 Jul 29, 2025
b8e261f
Fix pysvn info command
marcleblanc2 Jul 29, 2025
a08ddec
Fix cmd_svn_info error
marcleblanc2 Jul 29, 2025
948d371
Output python config on startup
marcleblanc2 Jul 29, 2025
69561fe
Seems to be working better with pysvn than svn cli
marcleblanc2 Jul 29, 2025
e9cf933
Fix syntax issues
marcleblanc2 Jul 29, 2025
8ebc0dd
Fix KeyError
marcleblanc2 Jul 29, 2025
2de418d
Fix git svn init --layout std
marcleblanc2 Jul 29, 2025
1b96ab7
Prep for v0.5.0 release
marcleblanc2 Jul 30, 2025
285c00d
Tidy up repo
marcleblanc2 Jul 30, 2025
9698c46
pop remaining_output when retrying a git svn fetch
marcleblanc2 Jul 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/github-actions-podman-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ platform_architecture="linux/amd64"
declare -a env_vars=(
"BUILD_BRANCH"
"BUILD_COMMIT"
"BUILD_COMMIT_MESSAGE"
"BUILD_DATE"
"BUILD_TAG"
)
Expand All @@ -44,6 +45,7 @@ declare -a image_tags=(
# Fill in env vars
BUILD_BRANCH="$(git rev-parse --abbrev-ref HEAD | sed 's/[^a-zA-Z0-9]/-/g' )" # Somehow turned out to be HEAD on a tag build???
BUILD_COMMIT="$(git rev-parse --short HEAD)"
BUILD_COMMIT_MESSAGE="$(git log -1 --pretty=%B)"
BUILD_DATE="$(date -u +'%Y-%m-%dT%H:%M:%SZ')"
BUILD_TAG="$(git tag --points-at HEAD)"
LATEST_TAG="latest"
Expand Down
8 changes: 4 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Sourcegraph
*.code-workspace
config/cloud-agent*
config/config.yaml
config/repos-to-convert*
config/service-account-key.json
config/cloud-agent/*
!config/cloud-agent/*example*
config/repo-converter/*
!config/repo-converter/*example*
dev/stats/repos.txt
logs/
notes/
Expand Down
3 changes: 1 addition & 2 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ Last Changed Date: 2024-10-23 13:00:39 +0000 (Wed... <- The date of the last
- The `--log-window-size` is too large, and the request times out / gets dropped at after 10 minutes of processing time
- The `--log-window-size` is too small, which multiplies the number of requests required to convert the repo, and every new request has its own possibility of timing out

5. The `_calculate_batch_revisions`, then `_git_svn_fetch` functions are only ever called after the `_check_if_repo_already_up_to_date` function concludes that the local git repo is behind the remote svn server, based on the "Last Changed Rev" response to the `svn info` command above
- Therefore, any `git svn fetch` execution which doesn't return any lines of output is considered a failure, even if the return_code is 0
5. `git svn fetch` is only called after determining that the local clone is out of date, so any `git svn fetch` execution which doesn't return any lines of output is considered a failure, even if the return_code is 0
- Therefore, we cannot trust the return code as an indicator of task success
- We must determine task success based on data in the local git repo, and the lines in stdout from the `git svn fetch` command
158 changes: 23 additions & 135 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ This repo was created for Sourcegraph Implementation Engineering deployments, an
- Sourcegraph was built with Git-native support, but customers have a variety of version control systems
- Sourcegraph has integrated the p4-fusion FOSS project into the product to support Perforce more directly
- Other version control systems are left up to the customer to convert to Git
- This project builds a framework to convert repos from other VCS to Git
- This project builds a framework to convert repos from other VCSes to Git

## Deployment

For Sourcegraph Cloud customers, they'll need to run the repo-converter, src serve-git, and the Sourcegraph Cloud Private Access Agent on a container platform with connectivity to both their code hosts, and their Sourcegraph Cloud instance. This can be done quite securely, as the src serve-git API endpoint does not need any ports exposed outside of the container network Running src serve-git and the agent together on the same container network allows the agent to use the container platform's local DNS service to reach src serve-git, and prevents src serve-git's unauthenticated HTTP endpoint from needing to be opened outside of the container network.
For Sourcegraph Cloud customers, they'll need to run the repo-converter, src serve-git, and the Sourcegraph Cloud Private Access Agent on a container platform with connectivity to both their Sourcegraph Cloud instance, and their code hosts. This can be done quite securely, as the src serve-git API endpoint does not need any ports exposed outside of the container network Running src serve-git and the agent together on the same container network allows the agent to use the container platform's local DNS service to reach src serve-git, and prevents src serve-git's unauthenticated HTTP endpoint from needing to be opened outside of the container network.

For Self-hosted Sourcegraph customers, they'll need to run the repo-converter and src serve-git together in a location that can reach their code hosts, and their Sourcegraph instance can reach the src serve-git API.

Expand All @@ -34,57 +34,36 @@ Deploying via containers allows for easier upgrades, troubleshooting, monitoring
3. SSD with low latency random writes
2. 2x the original repos' sum total size
4. CPU
1. The container runs a separate repo conversion process for each repo, in parallel, so maximum performance during the initial conversion process can be achieved with at least 1 thread or core for each repo in scope for conversion, plus threads for overhead
2. Repo conversion speed is more I/O-bound than CPU or memory
1. The container runs a separate repo conversion process for each repo, in parallel, so maximum performance during the initial conversion process can be achieved with at least 1 thread or core for each repo in scope for conversion
2. Repo conversion speed is more network-bound than CPU or memory
5. Memory
1. ~ 1 GB / repo to be converted in parallel
2. Depends on the size of the largest commit
3. `run.py` doesn't handle the repo content; this is handled by the git and subversion CLIs
1. ~1 GB / repo to be converted in parallel
2. Depends on the size of the largest commit, as `git svn fetch` seems to hold entire commits' contents in memory, and the number of parallel jobs
2. Code host
1. Subversion
1. HTTP(S)
2. Username and password for a user account that has read access to the needed repos
3. Support for SSH authentication hasn't been built, but could just be a matter of mounting the key, and not providing a username / password
2. TFVC (Microsoft Team Foundation Version Control)
1. Future, depending on availability of third party TFVC API clients
1. Future, depending on availability of TFVC API clients

## Setup with Sourcegraph Cloud - Sourcegraph Staff Only

1. Add the needed entries to the sourcegraphConnect targetGroups list in the Cloud instance's config.yaml, and get your PR approved and merged
```yaml
- dnsName: src-serve-git-ubuntu.local
listeningAddress: 100.100.100.0
name: src-serve-git-ubuntu-local
ports:
- 443
- dnsName: src-serve-git-wsl.local
listeningAddress: 100.100.100.1
name: src-serve-git-wsl-local
ports:
- 443
```
1. Follow the SG Cloud team's documentation to add the needed entries to the sourcegraphConnect targetGroups list in the Cloud instance's config.yaml, and get your PR approved and merged
2. Clone this repo to a VM on the customer's network, and either install Docker and Docker's Compose plugin, or connect to a container platform
3. Copy the `config.yaml` and `service-account-key.json` files using the instructions on the instance's Cloud Ops dashboard
- Paste them into `./config/cloud-agent-config.yaml` and `./config/cloud-agent-service-account-key.json`
4. Modify the contents of the `./config/cloud-agent-config.yaml` file:
- `serviceAccountKeyFile: /sg/cloud-agent-service-account-key.json` so that the Go binary inside the agent container finds this file in the path that's mapped via the docker-compose.yaml files
- Save them in the `./config/cloud-agent/` directory
4. Modify the contents of the `./config/cloud-agent/config.yaml` file:
- `serviceAccountKeyFile: /sg/config/service-account-key.json` so the Go binary inside the agent container finds this file in the path as it's mapped via the docker-compose.yaml files
- Only include the `- dialAddress` entries that this cloud agent instance can reach, remove the others, so the Cloud instance doesn't try using this agent instance for code hosts it can't reach
- Use extra caution when pasting the config.yaml in Windows, as it may use Windows' line endings or extra spaces, which breaks YAML, as a whitespace-dependent format
- Use extra caution when pasting the `config.yaml` file in Windows, as it may use Windows' line endings or extra spaces, which breaks YAML, as a whitespace-dependent format
5. Run `docker compose up -d`
6. Add a Code Host config to the customer's Cloud instance
- Type: src serve-git
- `"url": "http://src-serve-git-ubuntu.local:443",`
- or
- `"url": "http://src-serve-git-wsl.local:443",`
- The url is the name of the container, ex.
- `"url": "http://src-serve-git-ubuntu.local:443",`
- `"url": "http://src-serve-git-wsl.local:443",`
- Note the port 443, even when used with http://
7. Use the repo-converter to convert SVN, TFVC, or Git repos, to Git format, which will store them in the `src-serve-root` directory, or use any other means to get the repos into the directory
- There are docker-compose.yaml and override files in a few different directories in this repo, separated by use case, so that each use case only needs to run `docker compose up -d` in one directory, and not fuss around with `-f` paths.
- The only difference between the docker-compose-override.yaml files in host-ubuntu vs host-wsl is the src-serve-git container's name, which is how we get a separate `dnsName` for each.
- If you're using the repo-converter:
- If you're using the pre-built images, `cd ./deploy && docker compose up -d`
- If you're building the Docker images, `cd ./build && docker compose up -d --build`
- Either of these will start all 3 containers: cloud-agent, src-serve-git, and the repo-converter

7. Use the repo-converter to convert SVN, ~~TFVC, or Git repos,~~ to Git format, which will store them in the `../src-serve-root` directory, or use any other means to get the repos into the directory

## Configuration

Expand All @@ -94,116 +73,25 @@ Deploying via containers allows for easier upgrades, troubleshooting, monitoring
- See `./src/config/load_env.py` for the list of environment variables, their data types, and default values
- See `./src/config/validate_env.py` for any validation rules

### ./config/repos-to-convert.yaml
### repos-to-convert.yaml

- The contents of this file can be changed while the container is running, and the current version will be read at the start of each main loop in main.py
- Note, the syntax in the below examples is quite out of date, but the explanations of each may still be useful
- See `./config/example-repos-to-convert.yaml` for an example of the config layout
- See `./config/repo-converter/repos-to-convert-example.yaml` for an example of the config layout
- See `./src/config/load_repos.py` for the list of config keys
- TODO: Move the config schema to a separate file, and read it into the code

```YAML
xmlbeans:
# Usage: This key is used as the converted Git repo's name
# Required: Yes
# Format: String of YAML / git / filepath / URL-safe characters [A-Za-z0-9_-.]
# Default if unspecified: Invalid

type: SVN
# Usage: The type of repo to be converted, which determines the code path, binaries, and options used
# Required: Yes
# Format: String
# Options: SVN, TFVC
# Default if unspecified: Invalid

svn-repo-code-root: https://svn.apache.org/repos/asf/xmlbeans
# Usage: The root of the Subversion repo to be converted to a Git repo, thus the root of the Git repo
# Required: Yes
# Format: URL
# Default if unspecified: Invalid

code-host-name: svn.apache.org
git-org-name: asf
# Usage: The Sourcegraph UI shows users the repo path as code-host-name/git-org-name/repo-name for ease of navigation, and the repos are stored on disk in the same tree structure
# Required: Yes; this hasn't been tested without it, but it's highly encouraged for easier user navigation
# Format: String of filepath / URL-safe characters [A-Za-z0-9_-.]
# Default if unspecified: Empty

username: super_secret_username
password: super_secret_password
# Usage: Username and password to authenticate to the code host
# Required: If code host requires authentication
# Format: String
# Default if unspecified: Empty

fetch-batch-size: 100
# Usage: Number of Subversion changesets to try converting each batch; configure a higher number for initial cloning and for repos which get more than 100 changesets per REPO_CONVERTER_INTERVAL_SECONDS
# Required: No
# Format: Int > 0
# Default if unspecified: 100

git-default-branch: main
# Usage: Sets the name of the default branch in the resulting git repo; this is the branch that Sourcegraph users will see first, and will be indexed by default
# Required: No
# Format: String, git branch name
# Default if unspecified: main

layout: standard
trunk: trunk
branches: branches
tags: tags
# Usage: Match these to your Subversion repo's directory layout.
# Use `layout: standard` by default when trunk, branches, and tags are all top level directories in the repo root
# Or, specify the relative paths to these directories from the repo root
# These values are just passed to the subversion CLI as command args
# Required: Either layout or trunk, branches, tags
# Formats:
# trunk: String
# branches: String, or list of strings
# tags: String, or list of strings
# Default if unspecified: layout:standard

git-ignore-file-path: /path/mounted/inside/container/to/.gitignore
authors-file-path: /path/mounted/inside/container/to/authors-file-path
authors-prog-path: /path/mounted/inside/container/to/authors-prog-path
# Usage: If you need to use .gitignore, an author's file, or an author's program in the repo conversion, then mount them as a volume to the container, and provide the in-container paths here
# Required: No
# Format: String, file path
# Default if unspecified: empty

bare-clone: true
# Usage: If you need to keep a checked out working copy of the latest commit on disk for debugging purposes, set this to false
# Required: No
# Format: String
# Options: true, false
# Default if unspecified: true
```

## Performance

1. The default interval and batch size are set for sane polling for new repo commits during regular operations, but would be quite slow for initial cloning
2. For initial cloning, adjust:
1. The `REPO_CONVERTER_INTERVAL_SECONDS` environment variable
1. This is the outer loop interval, how often `run.py` will check if a conversion task is already running for the repo, and start one if not already running
1. This is the outer loop interval, how often the service will start a repo conversion job for each repo, if one is not already running
2. Thus, the longest break between two batches would be the length of this interval
3. Try 60 seconds, and adjust based on your source code host performance load
2. The `fetch-batch-size` config for each repo in the `./config/repos-to-convert.yaml` file
1. This is the number of commits the converter will try and convert in each execution. Larger batches can be more efficient as there are fewer breaks between intervals and less batch handling, however, if a batch fails, then it may need to retry a larger batch
2. Try 1000 for larger repos, and adjust for each repo as needed

```YAML
# docker-compose.yaml
services:
repo-converter:
environment:
- REPO_CONVERTER_INTERVAL_SECONDS=60
```

```YAML
# config/repos-to-convert.yaml
allura:
fetch-batch-size: 1000
```
3. Try 60 seconds, and adjust based on your source code host's performance
2. The `MAX_CONCURRENT_CONVERSIONS_PER_SERVER` and `MAX_CONCURRENT_CONVERSIONS_GLOBAL` environment variables
1. These are the maximum number of concurrent / parallel repo conversion jobs which can be run, per source code host, and total for this service
2. The defaults are 10 each, so if you're converting repos from two Subversion servers at the same time, a maximum of 10 jobs can run in parallel, from either server

## Contributions

Expand Down
9 changes: 5 additions & 4 deletions build/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ RUN apt-get update && \
python3 \
python3-dev \
python3-pip \
python3-svn \
python3-wheel \
subversion \
systemctl \
Expand Down Expand Up @@ -78,6 +79,10 @@ LABEL \
org.opencontainers.image.url="https://github.com/sourcegraph/repo-converter/pkgs/container/repo-converter" \
org.opencontainers.image.vendor="Sourcegraph"

# Create the user, with a home directory
RUN groupadd sourcegraph --gid 10002 && \
useradd sourcegraph --uid 10001 --gid sourcegraph --create-home --home-dir /home/sourcegraph

# Copy the source code into the image
# The contents of this dir will change most builds
COPY src/ src/
Expand All @@ -86,10 +91,6 @@ COPY src/ src/
# The contents of this file changes every build
COPY build/.env build/.env

# Create the user, with a home directory
RUN groupadd sourcegraph --gid 10002 && \
useradd sourcegraph --uid 10001 --gid sourcegraph --create-home --home-dir /home/sourcegraph

# Give ownership of the whole /sg dir and all its contents to the new user
RUN chown -R sourcegraph:sourcegraph /sg

Expand Down
15 changes: 10 additions & 5 deletions build/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,6 @@ ENV_FILE=".env"
echo "Environment variables:"
cat "$ENV_FILE"

# Run the build
echo "Running podman build"

# If an m is passed in the args
if [[ "$1" == *"m"* ]]
then
Expand All @@ -100,12 +97,19 @@ then
podman machine stop
# Start it as a background process
# and disown it, so it continues to run after this script ends
podman machine start & disown
# The disown doesn't seem to be working
# podman machine start & disown
nohup podman machine start >/dev/null 2>&1 &
# But give it 10 seconds to start up
sleep 20
sleep_time=20
echo "Giving podman VM $sleep_time seconds to start up"
sleep $sleep_time

fi

# Run the build
echo "Running podman build"

podman build \
--file ./Dockerfile \
--format docker \
Expand All @@ -123,6 +127,7 @@ then
# because podman-compose can't figure this out on its own
echo "Stopping old containers"
podman-compose down
podman network rm build_default -f

# Pull the latest tags of the other images
if [[ "$1" == *"p"* ]]
Expand Down
10 changes: 4 additions & 6 deletions build/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,20 @@
services:

cloud-agent:
command: ["-config=/sg/config/config.yaml"]
container_name: cloud-agent
command: ["-config=/sg/cloud-agent-config.yaml"]
image: index.docker.io/sourcegraph/src-tunnel-agent:latest
networks:
- default
restart: always
volumes:
- ../config/cloud-agent-service-account-key.json:/sg/cloud-agent-service-account-key.json:ro
- ../config/cloud-agent-config.yaml:/sg/cloud-agent-config.yaml:ro
- ../config/cloud-agent/:/sg/config:ro

repo-converter:
container_name: repo-converter
environment:
- CONCURRENCY_MONITOR_INTERVAL=300
- CONCURRENCY_MONITOR_INTERVAL=30
- LOG_LEVEL=DEBUG # DEBUG INFO WARNING ERROR CRITICAL # Default is INFO
# - LOG_RECENT_COMMITS=5
- MAX_CONCURRENT_CONVERSIONS_GLOBAL=20
- MAX_CONCURRENT_CONVERSIONS_PER_SERVER=5
# - MAX_CYCLES=5
Expand All @@ -28,8 +26,8 @@ services:
user: "10001:10002"
userns_mode: "keep-id:uid=10001,gid=10002"
volumes:
- ../config/repo-converter:/sg/config:ro
- ../dev/toprc:/home/sourcegraph/.config/procps/toprc:z # `top` config file
- ../config/repos-to-convert.yaml:/sg/repos-to-convert.yaml:ro
- ../src-serve-root/:/sg/src-serve-root:z

src-serve-git:
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
# Paste the config.yaml contents from the Cloud Ops dashboard here
# Paste the config.yaml contents from the Cloud Ops dashboard here
# Rename to config.yaml
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
// Paste the config.yaml contents from the Cloud Ops dashboard here
// Paste the config.yaml contents from the Cloud Ops dashboard here
// Rename to service-account-key.json
Loading