Skip to content

feat/add scaleway backend#3

Open
arthemide wants to merge 10 commits intoteilomillet:mainfrom
arthemide:feat/add-scaleway-backend
Open

feat/add scaleway backend#3
arthemide wants to merge 10 commits intoteilomillet:mainfrom
arthemide:feat/add-scaleway-backend

Conversation

@arthemide
Copy link
Copy Markdown
Contributor

@arthemide arthemide commented Apr 27, 2026

Summary by Sourcery

Add a new Scaleway backend that provisions GPU instances via Terraform, exposes training over HTTP, and integrates with retrain as a remote TrainHelper.

New Features:

  • Introduce a ScalewayTrainHelper backend that drives remote training and sampling on a Scaleway GPU instance via vLLM or SGLang.
  • Add a FastAPI-based training server and TerraformRunner to provision, manage, and tear down Scaleway GPU instances for training.
  • Provide a SROIE receipt-extraction campaign and dataset integration, including a custom reward function and campaign config.

Enhancements:

  • Register the Scaleway backend in backend_definitions with configurable options and backend capabilities.
  • Extend project configuration to include a scaleway extras group, a retrain-training-server CLI entrypoint, and updated dev dependencies and Makefile targets for dev setup and tests.

Build:

  • Add Terraform configuration, Makefile helpers, and associated infrastructure files for managing Scaleway GPU instances.

Documentation:

  • Document the Scaleway backend design and its implementation status in new markdown files.

Tests:

  • Add a backend contract test for the Scaleway backend using a fake ScalewayTrainHelper to validate option forwarding and lifecycle behavior.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 27, 2026

Reviewer's Guide

Adds a new Scaleway GPU backend that provisions remote inference/training infrastructure via Terraform, exposes it through a FastAPI training server, wires it into the existing backend system with contract tests, and introduces an SROIE sample campaign plus supporting docs and dev tooling updates.

Sequence diagram for Scaleway backend training lifecycle

sequenceDiagram
  actor User
  participant RetrainCLI
  participant ScalewayTrainHelper
  participant TerraformRunner
  participant TerraformCLI
  participant GPUInstance
  participant InferenceEngine
  participant TrainingServer

  User->>RetrainCLI: run retrain with backend=scaleway
  RetrainCLI->>ScalewayTrainHelper: instantiate with TrainConfig

  ScalewayTrainHelper->>TerraformRunner: new TerraformRunner(...)
  ScalewayTrainHelper->>TerraformRunner: apply()
  TerraformRunner->>TerraformCLI: terraform init/apply
  TerraformCLI-->>GPUInstance: provision GPU instance
  GPUInstance-->>TerraformCLI: outputs inference_url, training_url
  TerraformCLI-->>TerraformRunner: apply success
  TerraformRunner-->>ScalewayTrainHelper: inference_url, training_url

  ScalewayTrainHelper->>InferenceEngine: GET /health (poll)
  ScalewayTrainHelper->>TrainingServer: GET /health (poll)
  InferenceEngine-->>ScalewayTrainHelper: 200 ok
  TrainingServer-->>ScalewayTrainHelper: 200 ok

  loop sampling
    RetrainCLI->>ScalewayTrainHelper: sample(prompt_ids_list,...)
    ScalewayTrainHelper->>InferenceEngine: POST /v1/chat/completions
    InferenceEngine-->>ScalewayTrainHelper: token_ids, logprobs
    ScalewayTrainHelper-->>RetrainCLI: SampleBatch
  end

  loop training steps
    RetrainCLI->>ScalewayTrainHelper: train_step(tokens, logprobs, advantages,...)
    ScalewayTrainHelper->>TrainingServer: POST /train_step
    TrainingServer->>TrainingServer: LocalTrainHelper.train_step
    TrainingServer-->>ScalewayTrainHelper: {loss}
    ScalewayTrainHelper-->>RetrainCLI: loss
  end

  RetrainCLI->>ScalewayTrainHelper: checkpoint(name)
  ScalewayTrainHelper->>TrainingServer: POST /checkpoint
  TrainingServer->>LocalTrainHelper: checkpoint
  TrainingServer->>InferenceEngine: POST /v1/load_lora_adapter or /add_lora
  InferenceEngine-->>TrainingServer: ok

  User->>RetrainCLI: finish run
  RetrainCLI->>ScalewayTrainHelper: close() / __exit__
  ScalewayTrainHelper->>TerraformRunner: destroy()
  TerraformRunner->>TerraformCLI: terraform destroy
  TerraformCLI-->>GPUInstance: tear down instance
Loading

Class diagram for Scaleway backend and SROIE components

classDiagram
  class ScalewayTrainHelper {
    -str _model
    -int _lora_rank
    -str _inference_engine
    -int _health_timeout_s
    -float _health_poll_s
    -TerraformRunner _runner
    -str _inference_url
    -str _training_url
    -httpx.Client _client
    +ScalewayTrainHelper(model, lora_rank, gpu_type, zone, inference_engine, health_timeout_s, health_poll_s, max_model_len, state_dir)
    +checkpoint(name str) void
    +sample(prompt_ids_list list~list~int~~, num_samples int, max_tokens int, temperature float, top_p float) SampleBatch
    +sample_with_entropy(prompt_ids_list list~list~int~~, num_samples int, max_tokens int, temperature float, top_p float) EnrichedSampleBatch
    +train_step(all_tokens list~list~int~~, all_logprobs list~list~float~~, all_advantages list~list~float~~, lr float, weight_decay float) float
    +save_adapter(path str, name str) str
    +load_state(name str) void
    +close() void
    +__enter__() ScalewayTrainHelper
    +__exit__() void
    +_wait_healthy() void
    +_sample_one(prompt str, max_tokens int, temperature float, top_p float) tuple~list~int~~ list~float~~
    +_post_training(path str, body dict) dict
    +_get_tokenizer() object
    +_safe_extractall(tar TarFile, dest Path) void
  }

  class TerraformRunner {
    -str _zone
    -str _instance_type
    -str _model
    -int _lora_rank
    -str _inference_engine
    -str _caller_ip
    -int _max_model_len
    -Path _tf_dir
    -Path _state_dir
    -Path state_file
    -bool _applied
    +TerraformRunner(zone str, gpu_type str, model str, lora_rank int, inference_engine str, caller_ip str, max_model_len int, tf_dir Path, state_dir Path)
    +apply() tuple~str str~
    +destroy() void
    +_state_args() list~str~
    +_var_args() list~str~
    +_run_tf(args str) void
    +_read_outputs() dict
  }

  class TerraformError {
    <<exception>>
  }

  class ServerState {
    +object helper
    +str inference_url
    +str inference_engine
    +str adapter_base
  }

  class TrainingServerApp {
    +create_app(state ServerState) FastAPI
    +health(request Request) dict~str str~
    +train_step(req TrainStepRequest, request Request) dict~str float~
    +checkpoint(req CheckpointRequest, request Request) dict
    +save_adapter(req SaveAdapterRequest, request Request, background_tasks BackgroundTasks) StreamingResponse
    +load_state(req LoadStateRequest, request Request) dict
    +_reload_lora_on_inference(state ServerState, name str) void
    +main() void
  }

  class TrainStepRequest {
    +list~list~int~~ tokens
    +list~list~float~~ logprobs
    +list~list~float~~ advantages
    +float lr
    +float weight_decay
  }

  class CheckpointRequest {
    +str name
  }

  class SaveAdapterRequest {
    +str name
  }

  class LoadStateRequest {
    +str name
  }

  class SROIEDataSource {
    +int max_examples
    +SROIEDataSource(max_examples int)
    +load() list~Example~
  }

  class SROIEReward {
    +score(response str, reference str) float
  }

  class Example {
    +list~dict~ prompt
    +str reference
    +str task
    +int example_id
  }

  class LocalTrainHelper {
    +LocalTrainHelper(model str, adapter_path str, device str, lora_rank int, engine_type str)
    +train_step(tokens list~list~int~~, logprobs list~list~float~~, advantages list~list~float~~, lr float, weight_decay float) float
    +checkpoint(name str) void
    +save_adapter(path str, name str) str
    +load_state(name str) void
  }

  ScalewayTrainHelper --> TerraformRunner : uses
  TerraformRunner --> TerraformError : raises
  TrainingServerApp --> ServerState : holds
  ServerState --> LocalTrainHelper : helper
  TrainingServerApp --> TrainStepRequest : uses
  TrainingServerApp --> CheckpointRequest : uses
  TrainingServerApp --> SaveAdapterRequest : uses
  TrainingServerApp --> LoadStateRequest : uses
  SROIEDataSource --> Example : produces
  SROIEReward --> Example : consumes reference
Loading

File-Level Changes

Change Details Files
Introduce a ScalewayTrainHelper backend and register it in the backend registry with a full option schema and capabilities.
  • Add _create_scaleway factory that imports ScalewayTrainHelper, normalizes backend options, and sets a Terraform state dir under the run log_dir.
  • Register a new 'scaleway' BackendDefinition with capability flags and a typed option_schema (gpu_type, zone, inference_engine, health timeouts, max_model_len).
retrain/backend_definitions.py
Implement ScalewayTrainHelper that provisions/destroys a Scaleway GPU instance via Terraform, delegates sampling to a remote inference engine, and training to a remote FastAPI server.
  • Create ScalewayTrainHelper that constructs a TerraformRunner, applies Terraform, obtains inference/training URLs, and waits for both health endpoints before use.
  • Implement TrainHelper-like methods (checkpoint, sample, sample_with_entropy, train_step, save_adapter, load_state) that call the remote HTTP APIs, including safe tar extraction of adapters and lazy tokenizer loading via transformers.
  • Ensure deterministic teardown by calling TerraformRunner.destroy() and closing the httpx client in close()/exit, with a warning in del if the instance was not explicitly closed.
retrain/scaleway_backend.py
retrain/scaleway/terraform_runner.py
Add a FastAPI-based training server that exposes LocalTrainHelper over HTTP and reloads LoRA adapters on the inference engine.
  • Define a ServerState dataclass and FastAPI app that stores a LocalTrainHelper plus inference endpoint metadata, with a /health endpoint for readiness checks.
  • Expose /train_step, /checkpoint, /save_adapter, and /load_state endpoints that mirror the TrainHelper protocol and stream tarred adapters as responses.
  • Implement LoRA reload logic that calls either vLLM’s /v1/load_lora_adapter or SGLang’s /add_lora endpoint depending on the configured inference_engine, and provide a CLI entrypoint (main) wired to uvicorn for running the server on the GPU instance.
retrain/scaleway/training_server.py
Create Terraform-based infrastructure definitions and helper tooling to spin up a single Scaleway GPU instance for inference and training.
  • Add Terraform variables, main, outputs, and versions files defining a VPC, security group, flexible IP, and GPU instance wired with restricted ports 8000/8001 and cloud-init bootstrap.
  • Provide a cloud-init.yaml that installs a Python venv, installs retrain[local,scaleway] plus vLLM or SGLang, and sets up systemd services for the inference engine and training server.
  • Add a Terraform Makefile with convenience targets for Scaleway server lifecycle and sops-based secrets management, and encapsulate GPU type mapping, caller IP detection, and Terraform invocation in TerraformRunner.
retrain/scaleway/terraform/main.tf
retrain/scaleway/terraform/variables.tf
retrain/scaleway/terraform/outputs.tf
retrain/scaleway/terraform/versions.tf
retrain/scaleway/terraform/cloud-init.yaml
retrain/scaleway/terraform/Makefile
retrain/scaleway/terraform_runner.py
Extend tests to cover the Scaleway backend contract and ensure option forwarding and lifecycle behavior.
  • Introduce a _FakeScalewayTrainHelper capturing init kwargs and patch retrain.scaleway_backend with it in the test.
  • Verify that backend.create('scaleway', cfg) constructs the helper with both explicit and default options (gpu_type, zone, inference_engine, health timeouts, max_model_len, model, lora_rank, state_dir) and that the standard lifecycle (checkpoint/sample/train_step) is exercised via helper methods.
tests/test_backend_contract.py
Add an SROIE receipt extraction dataset, custom reward, and sample RLVR campaign configured to run on the Scaleway backend.
  • Implement SROIEDataSource that loads the darentang/sroie dataset from HuggingFace, builds system/user prompts for structured JSON extraction, and returns retrain Example objects.
  • Provide a score() function computing a simple field-wise accuracy reward over company/date/address/total, robust to markdown-wrapped JSON and minor formatting differences.
  • Add campaigns/sroie.toml that configures the scaleway backend (gpu_type, zone, model, lora_rank, inference_engine), RLVR algorithm parameters, training hyperparameters, data source, custom reward module, and logging directory.
campaigns/sroie_dataset.py
campaigns/sroie.toml
Update project configuration, dependencies, and dev tooling to support the Scaleway backend and training server.
  • Add a 'scaleway' optional dependency extra (httpx, fastapi, uvicorn, scaleway SDK) and a retrain-training-server console script entrypoint in pyproject.toml.
  • Include mojo and ordeal in core/dev dependencies and simplify/remove previous uv-specific dev dependency configuration.
  • Introduce a Makefile setup-dev target that installs dev extras via uv, and wire test-python and typecheck targets to depend on setup-dev so tests/typechecking run with the new dev dependencies.
  • Document the Scaleway backend design and implementation status in new markdown files, including open questions and validation checklist items.
pyproject.toml
Makefile
docs/scaleway-backend-design.md
docs/scaleway-implementation-status.md
.gitignore
uv.lock

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@arthemide arthemide changed the title Feat/add scaleway backend wip: Feat/add scaleway backend Apr 27, 2026
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 security issues, 4 other issues, and left some high level feedback:

Security issues:

  • Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
  • Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • In TerraformRunner._var_args you reference self._hf_token, but this attribute is never defined, so constructing the var args will currently raise an AttributeError; either initialize _hf_token in __init__ or drop the local variable and read directly from the environment.
  • Relying on ScalewayTrainHelper.__del__ to call TerraformRunner.destroy() makes teardown best-effort and GC-dependent; consider adding an explicit context manager or close method so callers can deterministically destroy the Scaleway instance.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `TerraformRunner._var_args` you reference `self._hf_token`, but this attribute is never defined, so constructing the var args will currently raise an `AttributeError`; either initialize `_hf_token` in `__init__` or drop the local variable and read directly from the environment.
- Relying on `ScalewayTrainHelper.__del__` to call `TerraformRunner.destroy()` makes teardown best-effort and GC-dependent; consider adding an explicit context manager or close method so callers can deterministically destroy the Scaleway instance.

## Individual Comments

### Comment 1
<location path="retrain/backend_definitions.py" line_range="205-210" />
<code_context>
     return None


+class ScalewayOptions(TypedDict):
+    gpu_type: str
+    zone: str
+    inference_engine: str
+    health_timeout_s: int
+    health_poll_s: float
+
+
</code_context>
<issue_to_address>
**issue:** ScalewayOptions TypedDict omits `max_model_len`, which is used in the backend options.

Since `option_schema` and `_create_scaleway` both use `max_model_len`, please add `max_model_len: int` to `ScalewayOptions` so the TypedDict matches the actual options and type checking remains accurate.
</issue_to_address>

### Comment 2
<location path="retrain/scaleway_backend.py" line_range="142-143" />
<code_context>
+        # Unpack the tar.gz returned by the server into local path/name/
+        local_dir = Path(path) / name
+        local_dir.mkdir(parents=True, exist_ok=True)
+        with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:gz") as tar:
+            tar.extractall(local_dir)
+        return str(local_dir)
+
</code_context>
<issue_to_address>
**🚨 issue (security):** Using `tar.extractall` directly can allow path traversal if the archive is compromised.

Even on a trusted training server, this remains a path traversal risk if the tar ever becomes attacker-controlled (e.g., future features or misconfigurations). Please validate each member path before extraction (e.g., resolve to an absolute path and ensure it stays under `local_dir`, rejecting `../` and absolute paths) or apply a standard safe-extract pattern.
</issue_to_address>

### Comment 3
<location path="docs/scaleway-backend-design.md" line_range="3" />
<code_context>
+# Scaleway Backend — Design Proposal
+
+**Statut** : RFC — en attente d'avis équipe  
+**Auteur** : Gireg Roussel  
+**Date** : 2026-04-24
</code_context>
<issue_to_address>
**issue (typo):** Corriger la tournure « en attente d'avis équipe » en ajoutant « de l' ».

Utiliser « en attente d'avis de l'équipe » pour assurer une formulation correcte en français.

```suggestion
**Statut** : RFC — en attente d'avis de l'équipe  
```
</issue_to_address>

### Comment 4
<location path="docs/scaleway-backend-design.md" line_range="110" />
<code_context>
+```
+Training server                Inference engine
+      │                               │
+      │  sauvegarde poids LoRA        │
+      │──────────────────────────────→│ POST /v1/load_lora_adapter  (vLLM)
+      │                               │ POST /add_lora               (SGLang)
</code_context>
<issue_to_address>
**issue (typo):** Ajout d'un article pour « sauvegarde poids LoRA ».

Dans le schéma, remplacez « sauvegarde poids LoRA » par « sauvegarde les poids LoRA » ou « sauvegarde des poids LoRA » pour corriger la grammaire.

```suggestion
      │  sauvegarde les poids LoRA    │
```
</issue_to_address>

### Comment 5
<location path="retrain/scaleway/terraform_runner.py" line_range="123-129" />
<code_context>
        result = subprocess.run(
            cmd,
            cwd=str(self._tf_dir),
            env=env,
            capture_output=False,
            text=True,
        )
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

### Comment 6
<location path="retrain/scaleway/terraform_runner.py" line_range="136-142" />
<code_context>
        result = subprocess.run(
            cmd,
            cwd=str(self._tf_dir),
            env=env,
            capture_output=True,
            text=True,
        )
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/backend_definitions.py Outdated
Comment thread retrain/scaleway_backend.py Outdated
Comment thread docs/scaleway-backend-design.md Outdated
Comment thread docs/scaleway-backend-design.md Outdated
Comment thread retrain/scaleway/terraform_runner.py Outdated
Comment thread retrain/scaleway/terraform_runner.py Outdated
@arthemide arthemide changed the title wip: Feat/add scaleway backend feat/add scaleway backend Apr 27, 2026
@teilomillet
Copy link
Copy Markdown
Owner

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In TerraformRunner._var_args, using shlex.quote when building the -var=key=value arguments is incorrect for a list-based subprocess.run (there is no shell to interpret the quotes), so the quotes end up in the actual Terraform variable values; consider removing shlex.quote and either pass raw strings or use a -var-file instead.
  • The ScalewayOptions TypedDict in backend_definitions.py is currently unused and the Scaleway backend option schema doesn’t cover all the fields shown in the example campaign (e.g. model, lora_rank under [backend.options]), so either wire this TypedDict into normalization and extend the schema or remove it and align the example config to only use validated backend options.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `TerraformRunner._var_args`, using `shlex.quote` when building the `-var=key=value` arguments is incorrect for a list-based `subprocess.run` (there is no shell to interpret the quotes), so the quotes end up in the actual Terraform variable values; consider removing `shlex.quote` and either pass raw strings or use a `-var-file` instead.
- The `ScalewayOptions` `TypedDict` in `backend_definitions.py` is currently unused and the Scaleway backend option schema doesn’t cover all the fields shown in the example campaign (e.g. `model`, `lora_rank` under `[backend.options]`), so either wire this `TypedDict` into normalization and extend the schema or remove it and align the example config to only use validated backend options.

## Individual Comments

### Comment 1
<location path="retrain/scaleway_backend.py" line_range="216-217" />
<code_context>
+        token_ids: list[int] = []
+        logprobs: list[float] = []
+        lp_content = (choice.get("logprobs") or {}).get("content") or []
+        for entry in lp_content:
+            token_ids.append(entry["token_id"] if "token_id" in entry else entry.get("token", 0))
+            top = entry.get("top_logprobs") or []
+            logprobs.append(top[0]["logprob"] if top else 0.0)
</code_context>
<issue_to_address>
**issue (bug_risk):** Fallback to `entry["token"]` for token IDs is unsafe because that field is typically text, not an ID.

In vLLM’s OpenAI-compatible responses, `token_id` is the numeric ID while `token` is the string. Using `entry.get("token", 0)` risks inserting a string or `0` into `token_ids`, which breaks the expectation that this list only contains ints.

If you only support engines that return `token_id`, consider skipping entries without it (or raising):

```python
for entry in lp_content:
    if "token_id" not in entry:
        continue  # or raise
    token_ids.append(entry["token_id"])
```

If you must support responses without `token_id`, derive the ID via the tokenizer from `token` instead of appending the raw string or a default value.
</issue_to_address>

### Comment 2
<location path="retrain/scaleway_backend.py" line_range="222-229" />
<code_context>
+
+    @staticmethod
+    def _safe_extractall(tar: tarfile.TarFile, dest: Path) -> None:
+        dest_resolved = dest.resolve()
+        for member in tar.getmembers():
+            member_path = (dest / member.name).resolve()
+            if not str(member_path).startswith(str(dest_resolved) + "/") and member_path != dest_resolved:
+                raise ValueError(f"Unsafe tar path rejected: {member.name}")
+        tar.extractall(dest)
+
+    def _post_training(self, path: str, body: dict) -> dict:
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Path traversal check in _safe_extractall is OS-path-sensitive and may be bypassed on Windows.

The check depends on `str(member_path).startswith(str(dest_resolved) + "/")`, which is POSIX-specific and can misbehave on Windows where `Path.resolve()` uses backslashes. Prefix-based string comparisons are also fragile in general.

Consider instead enforcing that all extracted paths stay under `dest_resolved` using `Path.relative_to` (or `Path.is_relative_to` on Python 3.9+):

```python
    dest_resolved = dest.resolve()
    for member in tar.getmembers():
        member_path = (dest / member.name).resolve()
        try:
            member_path.relative_to(dest_resolved)
        except ValueError:
            raise ValueError(f"Unsafe tar path rejected: {member.name}")
```

This removes manual string checks and is portable across platforms.

```suggestion
    @staticmethod
    def _safe_extractall(tar: tarfile.TarFile, dest: Path) -> None:
        dest_resolved = dest.resolve()
        for member in tar.getmembers():
            member_path = (dest / member.name).resolve()
            try:
                member_path.relative_to(dest_resolved)
            except ValueError:
                raise ValueError(f"Unsafe tar path rejected: {member.name}")
        tar.extractall(dest)
```
</issue_to_address>

### Comment 3
<location path="tests/test_backend_contract.py" line_range="88-85" />
<code_context>
         super().__init__()


+class _FakeScalewayTrainHelper(_BaseFakeHelper):
+    def __init__(self, *args, **kwargs):
+        _ = args, kwargs
+        super().__init__()
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Capture constructor kwargs in the fake helper so the test can assert backend options are wired correctly.

Currently `_FakeScalewayTrainHelper.__init__` ignores `kwargs`, so `test_scaleway_backend_contract` only checks that a helper is returned and its lifecycle methods are invoked. It doesn’t verify that `backend_definitions._create_scaleway` forwards options like `gpu_type`, `zone`, `inference_engine`, and `health_timeout_s` to `ScalewayTrainHelper`.

Please have `_FakeScalewayTrainHelper` store `kwargs` (e.g., `self.kwargs = kwargs`) and extend the test to assert that both default and explicit `backend_options` are propagated correctly.

Suggested implementation:

```python
class _FakeScalewayTrainHelper(_BaseFakeHelper):
    def __init__(self, *args, **kwargs):
        super().__init__()
        # Capture constructor arguments so tests can assert backend options are forwarded correctly
        self.args = args
        self.kwargs = kwargs

```

To fully implement your suggestion, the following updates are also needed in `tests/test_backend_contract.py` (or related test files), but they are not shown in the provided snippet:

1. In `test_scaleway_backend_contract`, after obtaining the helper instance (which should now be a `_FakeScalewayTrainHelper`), add assertions on `helper.kwargs`:
   - Verify that explicit `backend_options` (e.g. `gpu_type`, `zone`, `inference_engine`, `health_timeout_s`) passed via `backend_definitions._create_scaleway(...)` are present with the expected values.
   - Verify that defaulted options are present in `helper.kwargs` with their default values when not explicitly provided.
2. If `backend_definitions._create_scaleway` sometimes merges user-provided options with defaults, the test should construct the expected full options dict and compare it to `helper.kwargs` (or a relevant subset).
3. Ensure that the test uses the fake helper (e.g., via dependency injection or monkeypatching) so that the captured `kwargs` are available for assertions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/scaleway_backend.py Outdated
Comment thread retrain/scaleway_backend.py
Comment thread tests/test_backend_contract.py
@teilomillet
Copy link
Copy Markdown
Owner

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In ScalewayTrainHelper.__del__, accessing self._runner unguarded can raise AttributeError if __init__ fails early; consider checking hasattr(self, "_runner") (or using a separate teardown helper) before calling destroy() to make object cleanup robust to partially-initialized instances.
  • The Terraform caller_ip default of 0.0.0.0/0 combined with opening ports 8000/8001 is quite permissive; it might be safer to require an explicit caller IP or at least emit a clear warning when the default is used, to reduce the risk of unintentionally exposing inference and training endpoints.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `ScalewayTrainHelper.__del__`, accessing `self._runner` unguarded can raise `AttributeError` if `__init__` fails early; consider checking `hasattr(self, "_runner")` (or using a separate teardown helper) before calling `destroy()` to make object cleanup robust to partially-initialized instances.
- The Terraform `caller_ip` default of `0.0.0.0/0` combined with opening ports 8000/8001 is quite permissive; it might be safer to require an explicit caller IP or at least emit a clear warning when the default is used, to reduce the risk of unintentionally exposing inference and training endpoints.

## Individual Comments

### Comment 1
<location path="retrain/scaleway/training_server.py" line_range="118-123" />
<code_context>
+# LoRA reload on inference engine
+# ---------------------------------------------------------------------------
+
+def _reload_lora_on_inference(name: str) -> None:
+    # The adapter was just saved by checkpoint(); we need to tell the inference
+    # engine to reload it. vLLM and SGLang use different endpoints.
+    try:
+        if _inference_engine == "vllm":
+            httpx.post(
+                f"{_inference_url}/v1/load_lora_adapter",
+                json={"lora_name": name, "lora_path": name},
</code_context>
<issue_to_address>
**issue (bug_risk):** The LoRA path sent to the inference engine likely needs to include the adapter base directory, not just the name.

`lora_path` is currently set to `name` for both vLLM and SGLang, but adapters are saved under the base directory given by `--adapter-path` (default `/tmp/retrain_adapter`), e.g. `/tmp/retrain_adapter/<name>`. Unless the inference engine’s CWD matches that directory, passing just `name` will point to the wrong path.

Please construct `lora_path` using the same adapter base path as `LocalTrainHelper` (e.g. `os.path.join(adapter_base, name)`) so that reloads use the correct directory and don’t fail silently.
</issue_to_address>

### Comment 2
<location path="retrain/scaleway/training_server.py" line_range="89-98" />
<code_context>
+    return {}
+
+
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest) -> Response:
+    if _helper is None:
+        raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+    saved_path = _helper.save_adapter(req.path, req.name)
+    # Read all safetensors files and pack them as a tar archive
+    import io
+    import tarfile
+    buf = io.BytesIO()
+    adapter_dir = Path(saved_path)
+    with tarfile.open(fileobj=buf, mode="w:gz") as tar:
+        for f in adapter_dir.rglob("*"):
+            if f.is_file():
</code_context>
<issue_to_address>
**🚨 issue (security):** Saving adapters to an arbitrary `path` from the HTTP request can expose the server filesystem unnecessarily.

The server currently passes `req.path` directly into `_helper.save_adapter`, so any HTTP client can write files anywhere the process has permissions. Since the client only needs the resulting tar, not control over the on-disk location, consider either always writing under a fixed, server-controlled base directory or enforcing that `req.path` is confined within such a base directory before calling `LocalTrainHelper`.
</issue_to_address>

### Comment 3
<location path="tests/test_backend_contract.py" line_range="88-92" />
<code_context>
         super().__init__()


+class _FakeScalewayTrainHelper(_BaseFakeHelper):
+    def __init__(self, *args, **kwargs):
+        super().__init__()
+        self.init_args = args
+        self.init_kwargs = kwargs
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Also assert that `state_dir` is forwarded correctly to `ScalewayTrainHelper`

The fake helper already records `init_kwargs`, but `test_scaleway_backend_contract` currently only asserts GPU, zone, inference_engine, health timeouts, max_model_len, model, and lora_rank. Please extend the test to also assert that `state_dir` is forwarded correctly (e.g. equals `Path(config.log_dir) / '.terraform-state'`) so the backend contract covers Terraform state wiring and prevents regressions in the state location.

Suggested implementation:

```python
    assert helper.init_kwargs["lora_rank"] == config.lora_rank
    assert helper.init_kwargs["state_dir"] == Path(config.log_dir) / ".terraform-state"

```

If `Path` is not already imported in `tests/test_backend_contract.py`, add:
- `from pathlib import Path` alongside the other imports at the top of the file.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/scaleway/training_server.py Outdated
Comment thread retrain/scaleway/training_server.py
Comment thread tests/test_backend_contract.py
@arthemide
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In TerraformRunner.apply, _applied is only set to True after a successful apply, so a failed terraform apply can leave partially-created resources without any subsequent destroy; consider marking _applied based on the presence of a state file (or setting it before apply) so destroy() is attempted even after partial failures.
  • The Terraform security group defaults caller_ip to 0.0.0.0/0, effectively exposing ports 8000/8001 publicly despite the warning in TerraformRunner; it would be safer to require an explicit caller_ip (or auto-detect and inject the caller’s IP) and fail fast when it remains at the insecure default.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `TerraformRunner.apply`, `_applied` is only set to `True` after a successful `apply`, so a failed `terraform apply` can leave partially-created resources without any subsequent `destroy`; consider marking `_applied` based on the presence of a state file (or setting it before `apply`) so `destroy()` is attempted even after partial failures.
- The Terraform security group defaults `caller_ip` to `0.0.0.0/0`, effectively exposing ports 8000/8001 publicly despite the warning in `TerraformRunner`; it would be safer to require an explicit `caller_ip` (or auto-detect and inject the caller’s IP) and fail fast when it remains at the insecure default.

## Individual Comments

### Comment 1
<location path="retrain/scaleway/terraform_runner.py" line_range="42" />
<code_context>
+        tf_dir: Path | None = None,
+        state_dir: Path | None = None,
+    ) -> None:
+        if caller_ip == "0.0.0.0/0":
+            logger.warning(
+                "caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Consider enforcing or tightening the default `caller_ip` to avoid accidentally exposing ports.

`caller_ip` currently defaults to `0.0.0.0/0` with only a warning, which makes it easy to unintentionally expose inference/training endpoints publicly. Please either require an explicit `caller_ip` (no default) or gate `0.0.0.0/0` behind an explicit flag (e.g. `allow_public=True`) so that public exposure is a deliberate choice.

Suggested implementation:

```python
        zone: str,
        gpu_type: str,
        model: str,
        lora_rank: int,
        inference_engine: str,
        caller_ip: str,
        max_model_len: int = 32768,
        allow_public: bool = False,
        tf_dir: Path | None = None,
        state_dir: Path | None = None,
    ) -> None:
        if caller_ip == "0.0.0.0/0":
            if not allow_public:
                raise ValueError(
                    "caller_ip 0.0.0.0/0 will expose ports 8000/8001 to the internet. "
                    "Pass allow_public=True to explicitly allow public access or set "
                    "caller_ip to your public IP/CIDR to restrict access."
                )
            logger.warning(
                "caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
                "Set caller_ip to your public IP to restrict access."
            )

```

```python
        self._inference_engine = inference_engine
        self._caller_ip = caller_ip
        self._max_model_len = max_model_len
        self._allow_public = allow_public
        self._tf_dir = tf_dir or _TF_DIR

```

1. All call sites constructing this class must now pass `caller_ip` explicitly (it no longer has a default).
2. Any caller that intentionally wants fully public exposure must also pass `allow_public=True` alongside `caller_ip="0.0.0.0/0"`.
3. If `self._allow_public` is used elsewhere in the class (e.g. when rendering Terraform variables), ensure it is wired through consistently; if not needed, you may omit storing `_allow_public` and just rely on the constructor validation.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/scaleway/terraform_runner.py Outdated
tf_dir: Path | None = None,
state_dir: Path | None = None,
) -> None:
if caller_ip == "0.0.0.0/0":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 suggestion (security): Consider enforcing or tightening the default caller_ip to avoid accidentally exposing ports.

caller_ip currently defaults to 0.0.0.0/0 with only a warning, which makes it easy to unintentionally expose inference/training endpoints publicly. Please either require an explicit caller_ip (no default) or gate 0.0.0.0/0 behind an explicit flag (e.g. allow_public=True) so that public exposure is a deliberate choice.

Suggested implementation:

        zone: str,
        gpu_type: str,
        model: str,
        lora_rank: int,
        inference_engine: str,
        caller_ip: str,
        max_model_len: int = 32768,
        allow_public: bool = False,
        tf_dir: Path | None = None,
        state_dir: Path | None = None,
    ) -> None:
        if caller_ip == "0.0.0.0/0":
            if not allow_public:
                raise ValueError(
                    "caller_ip 0.0.0.0/0 will expose ports 8000/8001 to the internet. "
                    "Pass allow_public=True to explicitly allow public access or set "
                    "caller_ip to your public IP/CIDR to restrict access."
                )
            logger.warning(
                "caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
                "Set caller_ip to your public IP to restrict access."
            )
        self._inference_engine = inference_engine
        self._caller_ip = caller_ip
        self._max_model_len = max_model_len
        self._allow_public = allow_public
        self._tf_dir = tf_dir or _TF_DIR
  1. All call sites constructing this class must now pass caller_ip explicitly (it no longer has a default).
  2. Any caller that intentionally wants fully public exposure must also pass allow_public=True alongside caller_ip="0.0.0.0/0".
  3. If self._allow_public is used elsewhere in the class (e.g. when rendering Terraform variables), ensure it is wired through consistently; if not needed, you may omit storing _allow_public and just rely on the constructor validation.

Copy link
Copy Markdown

@SourceryAI SourceryAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In ScalewayTrainHelper.sample, the inline from transformers import AutoTokenizer import is unused and will raise an ImportError unless transformers is installed; since _get_tokenizer already lazily imports it, you can safely drop this redundant import or explicitly add transformers to the appropriate dependency group.
  • The Terraform security group currently allows SSH from 0.0.0.0/0 and defaults caller_ip to 0.0.0.0/0, effectively exposing ports 22/8000/8001 publicly; consider tightening these defaults (e.g. restricting to the caller IP by default or requiring explicit configuration) to avoid accidental exposure in production.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `ScalewayTrainHelper.sample`, the inline `from transformers import AutoTokenizer` import is unused and will raise an ImportError unless `transformers` is installed; since `_get_tokenizer` already lazily imports it, you can safely drop this redundant import or explicitly add `transformers` to the appropriate dependency group.
- The Terraform security group currently allows SSH from `0.0.0.0/0` and defaults `caller_ip` to `0.0.0.0/0`, effectively exposing ports 22/8000/8001 publicly; consider tightening these defaults (e.g. restricting to the caller IP by default or requiring explicit configuration) to avoid accidental exposure in production.

## Individual Comments

### Comment 1
<location path="retrain/scaleway_backend.py" line_range="179-185" />
<code_context>
+    # Internals
+    # ------------------------------------------------------------------
+
+    def _wait_healthy(self) -> None:
+        deadline = time.monotonic() + self._health_timeout_s
+        for url in (f"{self._inference_url}/health", f"{self._training_url}/health"):
+            logger.info("Waiting for %s …", url)
+            while True:
+                try:
+                    r = self._client.get(url, timeout=5)
+                    if r.status_code == 200:
+                        break
+                except Exception:
+                    pass
+                if time.monotonic() > deadline:
</code_context>
<issue_to_address>
**suggestion:** Swallowing all exceptions during health checks makes debugging connectivity issues harder.

In `_wait_healthy`, any exception from `self._client.get` is silently ignored. This preserves the retry logic but hides important diagnostics (e.g., DNS/TLS failures). Consider logging at least the exception type/message at debug level (or on first occurrence) so users can see why the service never became healthy while keeping the current control flow.

```suggestion
                try:
                    r = self._client.get(url, timeout=5)
                    if r.status_code == 200:
                        break
                except Exception:
                    logger.debug("Health check request to %s failed", url, exc_info=True)
                if time.monotonic() > deadline:
```
</issue_to_address>

### Comment 2
<location path="retrain/scaleway/terraform/main.tf" line_range="20-28" />
<code_context>
+# Security Group: GPU instance
+# Inference (8000) and training (8001) ports are restricted to caller_ip.
+# -----------------------------------------------------------------------------
+resource "scaleway_instance_security_group" "gpu" {
+  name                    = "${var.project_name}-sg-gpu"
+  description             = "GPU: inference/training ports restricted to caller, SSH via Scaleway IAM keys"
+  zone                    = var.zone
+  inbound_default_policy  = "drop"
+  outbound_default_policy = "accept"
+
+  # SSH — open to all IPs; key-based auth enforced via Scaleway IAM SSH keys
+  inbound_rule {
+    action   = "accept"
+    protocol = "TCP"
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Opening SSH (22) to 0.0.0.0/0 may be unnecessarily broad for many users.

Given SSH is open to `0.0.0.0/0`, this increases exposure to internet-wide scans even with key-based auth. Consider parameterizing this (like `caller_ip`) so SSH can be restricted when possible, or introduce an `ssh_cidr` variable with a more restrictive default to improve the default security posture.

Suggested implementation:

```
  # SSH — CIDR range configurable via `ssh_cidr` (default can be more restrictive than 0.0.0.0/0)
  inbound_rule {
    action   = "accept"
    protocol = "TCP"
    port     = 22
    ip_range = var.ssh_cidr
  }

```

To fully implement this change you should also:
1. Define a new `ssh_cidr` variable (for example in `retrain/scaleway/terraform/variables.tf` or alongside your other variables), e.g.:
   ```hcl
   variable "ssh_cidr" {
     description = "CIDR block allowed to SSH into the GPU instances (port 22)."
     type        = string
     default     = "0.0.0.0/0" # Consider overriding this per-environment with a more restrictive CIDR (e.g. your office/VPN IP).
   }
   ```
2. Optionally set a more restrictive default for `ssh_cidr` in your environment-specific `*.tfvars` files (for example a specific office or VPN CIDR, or the same value as `caller_ip`).
</issue_to_address>

Hi @arthemide! 👋

Thanks for trying out Sourcery by commenting with @sourcery-ai review! 🚀

Install the sourcery-ai bot to get automatic code reviews on every pull request ✨

Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/scaleway_backend.py
Comment thread retrain/scaleway/terraform/main.tf
@arthemide
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The training server relies heavily on module-level globals (_helper, _inference_url, _inference_engine, _adapter_base), which makes reuse and testing harder; consider moving this state into a startup hook or dependency-injected object (e.g., FastAPI dependency or lifespan handler) so you can construct multiple server instances with different settings without global mutation.
  • In ScalewayTrainHelper.__del__, the broad hasattr(self, "_runner") + blanket exception swallowing can hide real teardown issues and may run in an unsafe interpreter-shutdown context; it would be safer to rely on the explicit close/context manager for normal lifecycle and keep __del__ either minimal or logging-only, with TerraformRunner.destroy() already idempotent.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The training server relies heavily on module-level globals (`_helper`, `_inference_url`, `_inference_engine`, `_adapter_base`), which makes reuse and testing harder; consider moving this state into a startup hook or dependency-injected object (e.g., FastAPI dependency or lifespan handler) so you can construct multiple server instances with different settings without global mutation.
- In `ScalewayTrainHelper.__del__`, the broad `hasattr(self, "_runner")` + blanket exception swallowing can hide real teardown issues and may run in an unsafe interpreter-shutdown context; it would be safer to rely on the explicit `close`/context manager for normal lifecycle and keep `__del__` either minimal or logging-only, with `TerraformRunner.destroy()` already idempotent.

## Individual Comments

### Comment 1
<location path="retrain/scaleway/terraform_runner.py" line_range="29-35" />
<code_context>
+    pass
+
+
+def _detect_public_ip() -> str:
+    """Best-effort detection of the caller's public IP via ipify. Returns '0.0.0.0/0' on failure."""
+    try:
+        ip = httpx.get("https://api4.ipify.org", timeout=5).text.strip()
+        return f"{ip}/32"
+    except Exception:
+        return "0.0.0.0/0"
+
+
</code_context>
<issue_to_address>
**🚨 issue (security):** Defaulting to 0.0.0.0/0 on IP detection failure is risky; consider failing fast or requiring explicit caller_ip instead.

If IP detection fails, this silently expands access to `0.0.0.0/0`, effectively exposing ports 8000/8001. The warning log is easy to miss and materially changes the security posture. Please either fail and require an explicit `caller_ip`, or gate the `0.0.0.0/0` fallback behind an explicit opt-in (e.g., env flag) so users don’t unknowingly run with fully open ports when detection breaks (e.g., behind proxies/VPNs).
</issue_to_address>

### Comment 2
<location path="retrain/scaleway/terraform_runner.py" line_range="149-160" />
<code_context>
+            args.extend(pair)
+        return args
+
+    def _run_tf(self, *args: str) -> None:
+        cmd = ["terraform", *args]
+        env = {**os.environ, "TF_DATA_DIR": str(self._state_dir)}
+        result = subprocess.run(  # nosec B603 — list argv, shell=False, no user input reaches here
+            cmd,
+            cwd=str(self._tf_dir),
+            env=env,
+            capture_output=False,
+            text=True,
+        )
+        if result.returncode != 0:
+            raise TerraformError(f"terraform {args[0]} exited with code {result.returncode}")
+
+    def _read_outputs(self) -> dict[str, object]:
</code_context>
<issue_to_address>
**suggestion:** Terraform errors lose stderr/stdout context; consider capturing/logging output to aid debugging.

In `_run_tf`, using `capture_output=False` means Terraform’s stderr/stdout is only visible on the console, and `TerraformError` carries only the exit code. In non-interactive or aggregated-log environments, that makes failures hard to diagnose.

Consider enabling `capture_output=True` and, on non-zero exit codes, including trimmed stderr in `TerraformError` or logging it (e.g., via `logger.error`), so `init/apply/destroy` errors (provider/auth/network issues) are visible in logs.

```suggestion
    def _run_tf(self, *args: str) -> None:
        cmd = ["terraform", *args]
        env = {**os.environ, "TF_DATA_DIR": str(self._state_dir)}
        result = subprocess.run(  # nosec B603 — list argv, shell=False, no user input reaches here
            cmd,
            cwd=str(self._tf_dir),
            env=env,
            capture_output=True,
            text=True,
        )
        if result.returncode != 0:
            stderr = (result.stderr or "").strip()
            stdout = (result.stdout or "").strip()
            details = []
            if stderr:
                details.append(f"stderr:\n{stderr}")
            if stdout:
                details.append(f"stdout:\n{stdout}")
            details_str = "\n".join(details)
            if details_str:
                raise TerraformError(
                    f"terraform {args[0]} exited with code {result.returncode}\n{details_str}"
                )
            raise TerraformError(f"terraform {args[0]} exited with code {result.returncode}")
```
</issue_to_address>

### Comment 3
<location path="retrain/scaleway/training_server.py" line_range="90-99" />
<code_context>
+    return {}
+
+
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest) -> Response:
+    if _helper is None:
+        raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+    # Always write under the server-controlled base dir — client does not control on-disk location
+    saved_path = _helper.save_adapter(_adapter_base, req.name)
+    import io
+    import tarfile
+    buf = io.BytesIO()
+    adapter_dir = Path(saved_path)
+    with tarfile.open(fileobj=buf, mode="w:gz") as tar:
+        for f in adapter_dir.rglob("*"):
+            if f.is_file():
+                tar.add(f, arcname=f.relative_to(adapter_dir))
+    return Response(content=buf.getvalue(), media_type="application/octet-stream")
+
+
</code_context>
<issue_to_address>
**suggestion (performance):** save_adapter loads the entire adapter tarball into memory; consider streaming or temporary files for large adapters.

This approach buffers the entire tarball in memory via `BytesIO`, which can cause significant memory spikes for larger adapters. Instead, consider either streaming the tarball with `StreamingResponse` as you walk the files, or writing to a temporary file and streaming from disk to keep memory usage bounded as adapter sizes grow.

Suggested implementation:

```python
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest, background_tasks: BackgroundTasks) -> StreamingResponse:
+    if _helper is None:
+        raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+
+    # Always write under the server-controlled base dir — client does not control on-disk location
+    saved_path = _helper.save_adapter(_adapter_base, req.name)
+
+    adapter_dir = Path(saved_path)
+
+    # Build the tarball on disk to avoid buffering large adapters in memory.
+    with tempfile.NamedTemporaryFile(suffix=".tar.gz", delete=False) as tmp:
+        tmp_name = tmp.name
+        with tarfile.open(fileobj=tmp, mode="w:gz") as tar:
+            for f in adapter_dir.rglob("*"):
+                if f.is_file():
+                    tar.add(f, arcname=f.relative_to(adapter_dir))
+
+    def iter_file(path: str, chunk_size: int = 1024 * 1024):
+        with open(path, "rb") as f:
+            while True:
+                chunk = f.read(chunk_size)
+                if not chunk:
+                    break
+                yield chunk
+
+    # Ensure the temporary tarball is removed after the response is fully sent.
+    background_tasks.add_task(os.remove, tmp_name)
+
+    return StreamingResponse(
+        iter_file(tmp_name),
+        media_type="application/octet-stream",
+    )
+
+

```

```python
import argparse
import logging
import os
import tempfile
import tarfile
from fastapi import BackgroundTasks
from fastapi.responses import StreamingResponse

```

1. If `tarfile`, `tempfile`, `os`, `BackgroundTasks`, or `StreamingResponse` are already imported elsewhere in this file, remove the duplicate imports from the new block to avoid linter/CI warnings.
2. If your project uses a centralized response type alias or custom response helpers, you may want to align the `save_adapter` return annotation accordingly (e.g., a union including `StreamingResponse`).
3. If you want clients to receive a filename, consider adding a `Content-Disposition` header to the `StreamingResponse` (e.g., `headers={"Content-Disposition": f'attachment; filename="{req.name}.tar.gz"'}`).
</issue_to_address>

### Comment 4
<location path="docs/scaleway-backend-design.md" line_range="158" />
<code_context>
+
+1. `__init__``TerraformRunner.apply()` (blocks ~2–5 min, logs progress)
+2. Polls `/health` on both services
+3. Delegates `sample()` to the inference engine (OpenAI-compat, identical for vLLM and SGLang)
+4. Delegates `train_step()` / `checkpoint()` to the Training Server
+5. `checkpoint()` triggers LoRA reload on the inference engine via the appropriate endpoint:
</code_context>
<issue_to_address>
**suggestion (typo):** Use "OpenAI-compatible" instead of "OpenAI-compat" to match more standard wording.

Also update this occurrence to use "OpenAI-compatible" (e.g., "OpenAI-compatible, identical for vLLM and SGLang") to keep terminology consistent and avoid it looking like a typo.

Suggested implementation:

```
3. Delegates `sample()` to the inference engine (OpenAI-compatible, identical for vLLM and SGLang)

```

There may be other occurrences of "OpenAI-compat" elsewhere in this document or in related docs.
To keep terminology consistent and avoid this looking like a typo, search the repo (e.g., `rg "OpenAI-compat" docs/`) and update all instances to "OpenAI-compatible".
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread retrain/scaleway/terraform_runner.py Outdated
Comment thread retrain/scaleway/terraform_runner.py Outdated
Comment thread retrain/scaleway/training_server.py
Comment thread docs/scaleway-backend-design.md Outdated
@arthemide
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The Scaleway backend currently hardcodes caller_ip to auto-detected public IP in TerraformRunner with no way to override it from backend_options; consider exposing a caller_ip backend option so users in locked-down or proxy-only environments can bypass ipify and/or explicitly control the security group source CIDR.
  • In training_server.py there are two FastAPI app instances (app and the one returned by create_app), but only the latter has state.server initialized; consider either removing the unused module-level app or wiring it through create_app to avoid accidental use of an uninitialized app (e.g., via uvicorn retrain.scaleway.training_server:app).
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The Scaleway backend currently hardcodes `caller_ip` to auto-detected public IP in `TerraformRunner` with no way to override it from `backend_options`; consider exposing a `caller_ip` backend option so users in locked-down or proxy-only environments can bypass ipify and/or explicitly control the security group source CIDR.
- In `training_server.py` there are two FastAPI app instances (`app` and the one returned by `create_app`), but only the latter has `state.server` initialized; consider either removing the unused module-level `app` or wiring it through `create_app` to avoid accidental use of an uninitialized app (e.g., via `uvicorn retrain.scaleway.training_server:app`).

## Individual Comments

### Comment 1
<location path="retrain/scaleway/training_server.py" line_range="52-53" />
<code_context>
+    return _app
+
+
+# Module-level app instance used when launched via CLI (main())
+app = FastAPI(title="retrain-training-server")
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Routes are registered on a different FastAPI instance than the one you run in `main()`.

The route decorators bind to the module-level `app`, while `main()` runs a separate `server = create_app(state)`. That leaves `server` without routes and the global `app` without `state.server`, causing 404s/503s. Please consolidate to a single FastAPI instance, e.g. either construct `app = create_app(default_state)` at module level and set `state.server` there, or set `app.state.server = state` in `main()` and pass `app` to `uvicorn.run` instead of `server`.
</issue_to_address>

### Comment 2
<location path="pyproject.toml" line_range="23" />
<code_context>

 [project.optional-dependencies]
 local = ["torch>=2.4.0"]
+scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"]
 tinker = ["tinker>=0.13.1"]
 mlx = ["mlx-lm>=0.20.0; platform_system == 'Darwin'"]
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `fastapi>=0.136` pin is likely ahead of current releases and may be unresolvable for users.

This lower bound likely points to a non-existent FastAPI version and can cause `pip`/`uv` resolution failures when installing the `scaleway` extra. Please either pin to a currently released, tested range (e.g. `fastapi>=0.110,<1.0`) or drop the lower bound until you know the minimum compatible version. The same applies to `uvicorn[standard]>=0.46`, though FastAPI is more likely to fail resolution first.

```suggestion
scaleway = ["httpx>=0.28", "fastapi>=0.110,<1.0", "uvicorn[standard]>=0.22,<1.0", "scaleway>=2.11"]
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +52 to +53
# Module-level app instance used when launched via CLI (main())
app = FastAPI(title="retrain-training-server")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Routes are registered on a different FastAPI instance than the one you run in main().

The route decorators bind to the module-level app, while main() runs a separate server = create_app(state). That leaves server without routes and the global app without state.server, causing 404s/503s. Please consolidate to a single FastAPI instance, e.g. either construct app = create_app(default_state) at module level and set state.server there, or set app.state.server = state in main() and pass app to uvicorn.run instead of server.

Comment thread pyproject.toml

[project.optional-dependencies]
local = ["torch>=2.4.0"]
scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The fastapi>=0.136 pin is likely ahead of current releases and may be unresolvable for users.

This lower bound likely points to a non-existent FastAPI version and can cause pip/uv resolution failures when installing the scaleway extra. Please either pin to a currently released, tested range (e.g. fastapi>=0.110,<1.0) or drop the lower bound until you know the minimum compatible version. The same applies to uvicorn[standard]>=0.46, though FastAPI is more likely to fail resolution first.

Suggested change
scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"]
scaleway = ["httpx>=0.28", "fastapi>=0.110,<1.0", "uvicorn[standard]>=0.22,<1.0", "scaleway>=2.11"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants