feat/add scaleway backend#3
Conversation
Reviewer's GuideAdds a new Scaleway GPU backend that provisions remote inference/training infrastructure via Terraform, exposes it through a FastAPI training server, wires it into the existing backend system with contract tests, and introduces an SROIE sample campaign plus supporting docs and dev tooling updates. Sequence diagram for Scaleway backend training lifecyclesequenceDiagram
actor User
participant RetrainCLI
participant ScalewayTrainHelper
participant TerraformRunner
participant TerraformCLI
participant GPUInstance
participant InferenceEngine
participant TrainingServer
User->>RetrainCLI: run retrain with backend=scaleway
RetrainCLI->>ScalewayTrainHelper: instantiate with TrainConfig
ScalewayTrainHelper->>TerraformRunner: new TerraformRunner(...)
ScalewayTrainHelper->>TerraformRunner: apply()
TerraformRunner->>TerraformCLI: terraform init/apply
TerraformCLI-->>GPUInstance: provision GPU instance
GPUInstance-->>TerraformCLI: outputs inference_url, training_url
TerraformCLI-->>TerraformRunner: apply success
TerraformRunner-->>ScalewayTrainHelper: inference_url, training_url
ScalewayTrainHelper->>InferenceEngine: GET /health (poll)
ScalewayTrainHelper->>TrainingServer: GET /health (poll)
InferenceEngine-->>ScalewayTrainHelper: 200 ok
TrainingServer-->>ScalewayTrainHelper: 200 ok
loop sampling
RetrainCLI->>ScalewayTrainHelper: sample(prompt_ids_list,...)
ScalewayTrainHelper->>InferenceEngine: POST /v1/chat/completions
InferenceEngine-->>ScalewayTrainHelper: token_ids, logprobs
ScalewayTrainHelper-->>RetrainCLI: SampleBatch
end
loop training steps
RetrainCLI->>ScalewayTrainHelper: train_step(tokens, logprobs, advantages,...)
ScalewayTrainHelper->>TrainingServer: POST /train_step
TrainingServer->>TrainingServer: LocalTrainHelper.train_step
TrainingServer-->>ScalewayTrainHelper: {loss}
ScalewayTrainHelper-->>RetrainCLI: loss
end
RetrainCLI->>ScalewayTrainHelper: checkpoint(name)
ScalewayTrainHelper->>TrainingServer: POST /checkpoint
TrainingServer->>LocalTrainHelper: checkpoint
TrainingServer->>InferenceEngine: POST /v1/load_lora_adapter or /add_lora
InferenceEngine-->>TrainingServer: ok
User->>RetrainCLI: finish run
RetrainCLI->>ScalewayTrainHelper: close() / __exit__
ScalewayTrainHelper->>TerraformRunner: destroy()
TerraformRunner->>TerraformCLI: terraform destroy
TerraformCLI-->>GPUInstance: tear down instance
Class diagram for Scaleway backend and SROIE componentsclassDiagram
class ScalewayTrainHelper {
-str _model
-int _lora_rank
-str _inference_engine
-int _health_timeout_s
-float _health_poll_s
-TerraformRunner _runner
-str _inference_url
-str _training_url
-httpx.Client _client
+ScalewayTrainHelper(model, lora_rank, gpu_type, zone, inference_engine, health_timeout_s, health_poll_s, max_model_len, state_dir)
+checkpoint(name str) void
+sample(prompt_ids_list list~list~int~~, num_samples int, max_tokens int, temperature float, top_p float) SampleBatch
+sample_with_entropy(prompt_ids_list list~list~int~~, num_samples int, max_tokens int, temperature float, top_p float) EnrichedSampleBatch
+train_step(all_tokens list~list~int~~, all_logprobs list~list~float~~, all_advantages list~list~float~~, lr float, weight_decay float) float
+save_adapter(path str, name str) str
+load_state(name str) void
+close() void
+__enter__() ScalewayTrainHelper
+__exit__() void
+_wait_healthy() void
+_sample_one(prompt str, max_tokens int, temperature float, top_p float) tuple~list~int~~ list~float~~
+_post_training(path str, body dict) dict
+_get_tokenizer() object
+_safe_extractall(tar TarFile, dest Path) void
}
class TerraformRunner {
-str _zone
-str _instance_type
-str _model
-int _lora_rank
-str _inference_engine
-str _caller_ip
-int _max_model_len
-Path _tf_dir
-Path _state_dir
-Path state_file
-bool _applied
+TerraformRunner(zone str, gpu_type str, model str, lora_rank int, inference_engine str, caller_ip str, max_model_len int, tf_dir Path, state_dir Path)
+apply() tuple~str str~
+destroy() void
+_state_args() list~str~
+_var_args() list~str~
+_run_tf(args str) void
+_read_outputs() dict
}
class TerraformError {
<<exception>>
}
class ServerState {
+object helper
+str inference_url
+str inference_engine
+str adapter_base
}
class TrainingServerApp {
+create_app(state ServerState) FastAPI
+health(request Request) dict~str str~
+train_step(req TrainStepRequest, request Request) dict~str float~
+checkpoint(req CheckpointRequest, request Request) dict
+save_adapter(req SaveAdapterRequest, request Request, background_tasks BackgroundTasks) StreamingResponse
+load_state(req LoadStateRequest, request Request) dict
+_reload_lora_on_inference(state ServerState, name str) void
+main() void
}
class TrainStepRequest {
+list~list~int~~ tokens
+list~list~float~~ logprobs
+list~list~float~~ advantages
+float lr
+float weight_decay
}
class CheckpointRequest {
+str name
}
class SaveAdapterRequest {
+str name
}
class LoadStateRequest {
+str name
}
class SROIEDataSource {
+int max_examples
+SROIEDataSource(max_examples int)
+load() list~Example~
}
class SROIEReward {
+score(response str, reference str) float
}
class Example {
+list~dict~ prompt
+str reference
+str task
+int example_id
}
class LocalTrainHelper {
+LocalTrainHelper(model str, adapter_path str, device str, lora_rank int, engine_type str)
+train_step(tokens list~list~int~~, logprobs list~list~float~~, advantages list~list~float~~, lr float, weight_decay float) float
+checkpoint(name str) void
+save_adapter(path str, name str) str
+load_state(name str) void
}
ScalewayTrainHelper --> TerraformRunner : uses
TerraformRunner --> TerraformError : raises
TrainingServerApp --> ServerState : holds
ServerState --> LocalTrainHelper : helper
TrainingServerApp --> TrainStepRequest : uses
TrainingServerApp --> CheckpointRequest : uses
TrainingServerApp --> SaveAdapterRequest : uses
TrainingServerApp --> LoadStateRequest : uses
SROIEDataSource --> Example : produces
SROIEReward --> Example : consumes reference
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 security issues, 4 other issues, and left some high level feedback:
Security issues:
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- In
TerraformRunner._var_argsyou referenceself._hf_token, but this attribute is never defined, so constructing the var args will currently raise anAttributeError; either initialize_hf_tokenin__init__or drop the local variable and read directly from the environment. - Relying on
ScalewayTrainHelper.__del__to callTerraformRunner.destroy()makes teardown best-effort and GC-dependent; consider adding an explicit context manager or close method so callers can deterministically destroy the Scaleway instance.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `TerraformRunner._var_args` you reference `self._hf_token`, but this attribute is never defined, so constructing the var args will currently raise an `AttributeError`; either initialize `_hf_token` in `__init__` or drop the local variable and read directly from the environment.
- Relying on `ScalewayTrainHelper.__del__` to call `TerraformRunner.destroy()` makes teardown best-effort and GC-dependent; consider adding an explicit context manager or close method so callers can deterministically destroy the Scaleway instance.
## Individual Comments
### Comment 1
<location path="retrain/backend_definitions.py" line_range="205-210" />
<code_context>
return None
+class ScalewayOptions(TypedDict):
+ gpu_type: str
+ zone: str
+ inference_engine: str
+ health_timeout_s: int
+ health_poll_s: float
+
+
</code_context>
<issue_to_address>
**issue:** ScalewayOptions TypedDict omits `max_model_len`, which is used in the backend options.
Since `option_schema` and `_create_scaleway` both use `max_model_len`, please add `max_model_len: int` to `ScalewayOptions` so the TypedDict matches the actual options and type checking remains accurate.
</issue_to_address>
### Comment 2
<location path="retrain/scaleway_backend.py" line_range="142-143" />
<code_context>
+ # Unpack the tar.gz returned by the server into local path/name/
+ local_dir = Path(path) / name
+ local_dir.mkdir(parents=True, exist_ok=True)
+ with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:gz") as tar:
+ tar.extractall(local_dir)
+ return str(local_dir)
+
</code_context>
<issue_to_address>
**🚨 issue (security):** Using `tar.extractall` directly can allow path traversal if the archive is compromised.
Even on a trusted training server, this remains a path traversal risk if the tar ever becomes attacker-controlled (e.g., future features or misconfigurations). Please validate each member path before extraction (e.g., resolve to an absolute path and ensure it stays under `local_dir`, rejecting `../` and absolute paths) or apply a standard safe-extract pattern.
</issue_to_address>
### Comment 3
<location path="docs/scaleway-backend-design.md" line_range="3" />
<code_context>
+# Scaleway Backend — Design Proposal
+
+**Statut** : RFC — en attente d'avis équipe
+**Auteur** : Gireg Roussel
+**Date** : 2026-04-24
</code_context>
<issue_to_address>
**issue (typo):** Corriger la tournure « en attente d'avis équipe » en ajoutant « de l' ».
Utiliser « en attente d'avis de l'équipe » pour assurer une formulation correcte en français.
```suggestion
**Statut** : RFC — en attente d'avis de l'équipe
```
</issue_to_address>
### Comment 4
<location path="docs/scaleway-backend-design.md" line_range="110" />
<code_context>
+```
+Training server Inference engine
+ │ │
+ │ sauvegarde poids LoRA │
+ │──────────────────────────────→│ POST /v1/load_lora_adapter (vLLM)
+ │ │ POST /add_lora (SGLang)
</code_context>
<issue_to_address>
**issue (typo):** Ajout d'un article pour « sauvegarde poids LoRA ».
Dans le schéma, remplacez « sauvegarde poids LoRA » par « sauvegarde les poids LoRA » ou « sauvegarde des poids LoRA » pour corriger la grammaire.
```suggestion
│ sauvegarde les poids LoRA │
```
</issue_to_address>
### Comment 5
<location path="retrain/scaleway/terraform_runner.py" line_range="123-129" />
<code_context>
result = subprocess.run(
cmd,
cwd=str(self._tf_dir),
env=env,
capture_output=False,
text=True,
)
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
### Comment 6
<location path="retrain/scaleway/terraform_runner.py" line_range="136-142" />
<code_context>
result = subprocess.run(
cmd,
cwd=str(self._tf_dir),
env=env,
capture_output=True,
text=True,
)
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
TerraformRunner._var_args, usingshlex.quotewhen building the-var=key=valuearguments is incorrect for a list-basedsubprocess.run(there is no shell to interpret the quotes), so the quotes end up in the actual Terraform variable values; consider removingshlex.quoteand either pass raw strings or use a-var-fileinstead. - The
ScalewayOptionsTypedDictinbackend_definitions.pyis currently unused and the Scaleway backend option schema doesn’t cover all the fields shown in the example campaign (e.g.model,lora_rankunder[backend.options]), so either wire thisTypedDictinto normalization and extend the schema or remove it and align the example config to only use validated backend options.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `TerraformRunner._var_args`, using `shlex.quote` when building the `-var=key=value` arguments is incorrect for a list-based `subprocess.run` (there is no shell to interpret the quotes), so the quotes end up in the actual Terraform variable values; consider removing `shlex.quote` and either pass raw strings or use a `-var-file` instead.
- The `ScalewayOptions` `TypedDict` in `backend_definitions.py` is currently unused and the Scaleway backend option schema doesn’t cover all the fields shown in the example campaign (e.g. `model`, `lora_rank` under `[backend.options]`), so either wire this `TypedDict` into normalization and extend the schema or remove it and align the example config to only use validated backend options.
## Individual Comments
### Comment 1
<location path="retrain/scaleway_backend.py" line_range="216-217" />
<code_context>
+ token_ids: list[int] = []
+ logprobs: list[float] = []
+ lp_content = (choice.get("logprobs") or {}).get("content") or []
+ for entry in lp_content:
+ token_ids.append(entry["token_id"] if "token_id" in entry else entry.get("token", 0))
+ top = entry.get("top_logprobs") or []
+ logprobs.append(top[0]["logprob"] if top else 0.0)
</code_context>
<issue_to_address>
**issue (bug_risk):** Fallback to `entry["token"]` for token IDs is unsafe because that field is typically text, not an ID.
In vLLM’s OpenAI-compatible responses, `token_id` is the numeric ID while `token` is the string. Using `entry.get("token", 0)` risks inserting a string or `0` into `token_ids`, which breaks the expectation that this list only contains ints.
If you only support engines that return `token_id`, consider skipping entries without it (or raising):
```python
for entry in lp_content:
if "token_id" not in entry:
continue # or raise
token_ids.append(entry["token_id"])
```
If you must support responses without `token_id`, derive the ID via the tokenizer from `token` instead of appending the raw string or a default value.
</issue_to_address>
### Comment 2
<location path="retrain/scaleway_backend.py" line_range="222-229" />
<code_context>
+
+ @staticmethod
+ def _safe_extractall(tar: tarfile.TarFile, dest: Path) -> None:
+ dest_resolved = dest.resolve()
+ for member in tar.getmembers():
+ member_path = (dest / member.name).resolve()
+ if not str(member_path).startswith(str(dest_resolved) + "/") and member_path != dest_resolved:
+ raise ValueError(f"Unsafe tar path rejected: {member.name}")
+ tar.extractall(dest)
+
+ def _post_training(self, path: str, body: dict) -> dict:
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Path traversal check in _safe_extractall is OS-path-sensitive and may be bypassed on Windows.
The check depends on `str(member_path).startswith(str(dest_resolved) + "/")`, which is POSIX-specific and can misbehave on Windows where `Path.resolve()` uses backslashes. Prefix-based string comparisons are also fragile in general.
Consider instead enforcing that all extracted paths stay under `dest_resolved` using `Path.relative_to` (or `Path.is_relative_to` on Python 3.9+):
```python
dest_resolved = dest.resolve()
for member in tar.getmembers():
member_path = (dest / member.name).resolve()
try:
member_path.relative_to(dest_resolved)
except ValueError:
raise ValueError(f"Unsafe tar path rejected: {member.name}")
```
This removes manual string checks and is portable across platforms.
```suggestion
@staticmethod
def _safe_extractall(tar: tarfile.TarFile, dest: Path) -> None:
dest_resolved = dest.resolve()
for member in tar.getmembers():
member_path = (dest / member.name).resolve()
try:
member_path.relative_to(dest_resolved)
except ValueError:
raise ValueError(f"Unsafe tar path rejected: {member.name}")
tar.extractall(dest)
```
</issue_to_address>
### Comment 3
<location path="tests/test_backend_contract.py" line_range="88-85" />
<code_context>
super().__init__()
+class _FakeScalewayTrainHelper(_BaseFakeHelper):
+ def __init__(self, *args, **kwargs):
+ _ = args, kwargs
+ super().__init__()
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Capture constructor kwargs in the fake helper so the test can assert backend options are wired correctly.
Currently `_FakeScalewayTrainHelper.__init__` ignores `kwargs`, so `test_scaleway_backend_contract` only checks that a helper is returned and its lifecycle methods are invoked. It doesn’t verify that `backend_definitions._create_scaleway` forwards options like `gpu_type`, `zone`, `inference_engine`, and `health_timeout_s` to `ScalewayTrainHelper`.
Please have `_FakeScalewayTrainHelper` store `kwargs` (e.g., `self.kwargs = kwargs`) and extend the test to assert that both default and explicit `backend_options` are propagated correctly.
Suggested implementation:
```python
class _FakeScalewayTrainHelper(_BaseFakeHelper):
def __init__(self, *args, **kwargs):
super().__init__()
# Capture constructor arguments so tests can assert backend options are forwarded correctly
self.args = args
self.kwargs = kwargs
```
To fully implement your suggestion, the following updates are also needed in `tests/test_backend_contract.py` (or related test files), but they are not shown in the provided snippet:
1. In `test_scaleway_backend_contract`, after obtaining the helper instance (which should now be a `_FakeScalewayTrainHelper`), add assertions on `helper.kwargs`:
- Verify that explicit `backend_options` (e.g. `gpu_type`, `zone`, `inference_engine`, `health_timeout_s`) passed via `backend_definitions._create_scaleway(...)` are present with the expected values.
- Verify that defaulted options are present in `helper.kwargs` with their default values when not explicitly provided.
2. If `backend_definitions._create_scaleway` sometimes merges user-provided options with defaults, the test should construct the expected full options dict and compare it to `helper.kwargs` (or a relevant subset).
3. Ensure that the test uses the fake helper (e.g., via dependency injection or monkeypatching) so that the captured `kwargs` are available for assertions.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
ScalewayTrainHelper.__del__, accessingself._runnerunguarded can raiseAttributeErrorif__init__fails early; consider checkinghasattr(self, "_runner")(or using a separate teardown helper) before callingdestroy()to make object cleanup robust to partially-initialized instances. - The Terraform
caller_ipdefault of0.0.0.0/0combined with opening ports 8000/8001 is quite permissive; it might be safer to require an explicit caller IP or at least emit a clear warning when the default is used, to reduce the risk of unintentionally exposing inference and training endpoints.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `ScalewayTrainHelper.__del__`, accessing `self._runner` unguarded can raise `AttributeError` if `__init__` fails early; consider checking `hasattr(self, "_runner")` (or using a separate teardown helper) before calling `destroy()` to make object cleanup robust to partially-initialized instances.
- The Terraform `caller_ip` default of `0.0.0.0/0` combined with opening ports 8000/8001 is quite permissive; it might be safer to require an explicit caller IP or at least emit a clear warning when the default is used, to reduce the risk of unintentionally exposing inference and training endpoints.
## Individual Comments
### Comment 1
<location path="retrain/scaleway/training_server.py" line_range="118-123" />
<code_context>
+# LoRA reload on inference engine
+# ---------------------------------------------------------------------------
+
+def _reload_lora_on_inference(name: str) -> None:
+ # The adapter was just saved by checkpoint(); we need to tell the inference
+ # engine to reload it. vLLM and SGLang use different endpoints.
+ try:
+ if _inference_engine == "vllm":
+ httpx.post(
+ f"{_inference_url}/v1/load_lora_adapter",
+ json={"lora_name": name, "lora_path": name},
</code_context>
<issue_to_address>
**issue (bug_risk):** The LoRA path sent to the inference engine likely needs to include the adapter base directory, not just the name.
`lora_path` is currently set to `name` for both vLLM and SGLang, but adapters are saved under the base directory given by `--adapter-path` (default `/tmp/retrain_adapter`), e.g. `/tmp/retrain_adapter/<name>`. Unless the inference engine’s CWD matches that directory, passing just `name` will point to the wrong path.
Please construct `lora_path` using the same adapter base path as `LocalTrainHelper` (e.g. `os.path.join(adapter_base, name)`) so that reloads use the correct directory and don’t fail silently.
</issue_to_address>
### Comment 2
<location path="retrain/scaleway/training_server.py" line_range="89-98" />
<code_context>
+ return {}
+
+
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest) -> Response:
+ if _helper is None:
+ raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+ saved_path = _helper.save_adapter(req.path, req.name)
+ # Read all safetensors files and pack them as a tar archive
+ import io
+ import tarfile
+ buf = io.BytesIO()
+ adapter_dir = Path(saved_path)
+ with tarfile.open(fileobj=buf, mode="w:gz") as tar:
+ for f in adapter_dir.rglob("*"):
+ if f.is_file():
</code_context>
<issue_to_address>
**🚨 issue (security):** Saving adapters to an arbitrary `path` from the HTTP request can expose the server filesystem unnecessarily.
The server currently passes `req.path` directly into `_helper.save_adapter`, so any HTTP client can write files anywhere the process has permissions. Since the client only needs the resulting tar, not control over the on-disk location, consider either always writing under a fixed, server-controlled base directory or enforcing that `req.path` is confined within such a base directory before calling `LocalTrainHelper`.
</issue_to_address>
### Comment 3
<location path="tests/test_backend_contract.py" line_range="88-92" />
<code_context>
super().__init__()
+class _FakeScalewayTrainHelper(_BaseFakeHelper):
+ def __init__(self, *args, **kwargs):
+ super().__init__()
+ self.init_args = args
+ self.init_kwargs = kwargs
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Also assert that `state_dir` is forwarded correctly to `ScalewayTrainHelper`
The fake helper already records `init_kwargs`, but `test_scaleway_backend_contract` currently only asserts GPU, zone, inference_engine, health timeouts, max_model_len, model, and lora_rank. Please extend the test to also assert that `state_dir` is forwarded correctly (e.g. equals `Path(config.log_dir) / '.terraform-state'`) so the backend contract covers Terraform state wiring and prevents regressions in the state location.
Suggested implementation:
```python
assert helper.init_kwargs["lora_rank"] == config.lora_rank
assert helper.init_kwargs["state_dir"] == Path(config.log_dir) / ".terraform-state"
```
If `Path` is not already imported in `tests/test_backend_contract.py`, add:
- `from pathlib import Path` alongside the other imports at the top of the file.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
TerraformRunner.apply,_appliedis only set toTrueafter a successfulapply, so a failedterraform applycan leave partially-created resources without any subsequentdestroy; consider marking_appliedbased on the presence of a state file (or setting it beforeapply) sodestroy()is attempted even after partial failures. - The Terraform security group defaults
caller_ipto0.0.0.0/0, effectively exposing ports 8000/8001 publicly despite the warning inTerraformRunner; it would be safer to require an explicitcaller_ip(or auto-detect and inject the caller’s IP) and fail fast when it remains at the insecure default.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `TerraformRunner.apply`, `_applied` is only set to `True` after a successful `apply`, so a failed `terraform apply` can leave partially-created resources without any subsequent `destroy`; consider marking `_applied` based on the presence of a state file (or setting it before `apply`) so `destroy()` is attempted even after partial failures.
- The Terraform security group defaults `caller_ip` to `0.0.0.0/0`, effectively exposing ports 8000/8001 publicly despite the warning in `TerraformRunner`; it would be safer to require an explicit `caller_ip` (or auto-detect and inject the caller’s IP) and fail fast when it remains at the insecure default.
## Individual Comments
### Comment 1
<location path="retrain/scaleway/terraform_runner.py" line_range="42" />
<code_context>
+ tf_dir: Path | None = None,
+ state_dir: Path | None = None,
+ ) -> None:
+ if caller_ip == "0.0.0.0/0":
+ logger.warning(
+ "caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Consider enforcing or tightening the default `caller_ip` to avoid accidentally exposing ports.
`caller_ip` currently defaults to `0.0.0.0/0` with only a warning, which makes it easy to unintentionally expose inference/training endpoints publicly. Please either require an explicit `caller_ip` (no default) or gate `0.0.0.0/0` behind an explicit flag (e.g. `allow_public=True`) so that public exposure is a deliberate choice.
Suggested implementation:
```python
zone: str,
gpu_type: str,
model: str,
lora_rank: int,
inference_engine: str,
caller_ip: str,
max_model_len: int = 32768,
allow_public: bool = False,
tf_dir: Path | None = None,
state_dir: Path | None = None,
) -> None:
if caller_ip == "0.0.0.0/0":
if not allow_public:
raise ValueError(
"caller_ip 0.0.0.0/0 will expose ports 8000/8001 to the internet. "
"Pass allow_public=True to explicitly allow public access or set "
"caller_ip to your public IP/CIDR to restrict access."
)
logger.warning(
"caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
"Set caller_ip to your public IP to restrict access."
)
```
```python
self._inference_engine = inference_engine
self._caller_ip = caller_ip
self._max_model_len = max_model_len
self._allow_public = allow_public
self._tf_dir = tf_dir or _TF_DIR
```
1. All call sites constructing this class must now pass `caller_ip` explicitly (it no longer has a default).
2. Any caller that intentionally wants fully public exposure must also pass `allow_public=True` alongside `caller_ip="0.0.0.0/0"`.
3. If `self._allow_public` is used elsewhere in the class (e.g. when rendering Terraform variables), ensure it is wired through consistently; if not needed, you may omit storing `_allow_public` and just rely on the constructor validation.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| tf_dir: Path | None = None, | ||
| state_dir: Path | None = None, | ||
| ) -> None: | ||
| if caller_ip == "0.0.0.0/0": |
There was a problem hiding this comment.
🚨 suggestion (security): Consider enforcing or tightening the default caller_ip to avoid accidentally exposing ports.
caller_ip currently defaults to 0.0.0.0/0 with only a warning, which makes it easy to unintentionally expose inference/training endpoints publicly. Please either require an explicit caller_ip (no default) or gate 0.0.0.0/0 behind an explicit flag (e.g. allow_public=True) so that public exposure is a deliberate choice.
Suggested implementation:
zone: str,
gpu_type: str,
model: str,
lora_rank: int,
inference_engine: str,
caller_ip: str,
max_model_len: int = 32768,
allow_public: bool = False,
tf_dir: Path | None = None,
state_dir: Path | None = None,
) -> None:
if caller_ip == "0.0.0.0/0":
if not allow_public:
raise ValueError(
"caller_ip 0.0.0.0/0 will expose ports 8000/8001 to the internet. "
"Pass allow_public=True to explicitly allow public access or set "
"caller_ip to your public IP/CIDR to restrict access."
)
logger.warning(
"caller_ip is 0.0.0.0/0 — ports 8000/8001 will be open to the internet. "
"Set caller_ip to your public IP to restrict access."
) self._inference_engine = inference_engine
self._caller_ip = caller_ip
self._max_model_len = max_model_len
self._allow_public = allow_public
self._tf_dir = tf_dir or _TF_DIR- All call sites constructing this class must now pass
caller_ipexplicitly (it no longer has a default). - Any caller that intentionally wants fully public exposure must also pass
allow_public=Truealongsidecaller_ip="0.0.0.0/0". - If
self._allow_publicis used elsewhere in the class (e.g. when rendering Terraform variables), ensure it is wired through consistently; if not needed, you may omit storing_allow_publicand just rely on the constructor validation.
SourceryAI
left a comment
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
ScalewayTrainHelper.sample, the inlinefrom transformers import AutoTokenizerimport is unused and will raise an ImportError unlesstransformersis installed; since_get_tokenizeralready lazily imports it, you can safely drop this redundant import or explicitly addtransformersto the appropriate dependency group. - The Terraform security group currently allows SSH from
0.0.0.0/0and defaultscaller_ipto0.0.0.0/0, effectively exposing ports 22/8000/8001 publicly; consider tightening these defaults (e.g. restricting to the caller IP by default or requiring explicit configuration) to avoid accidental exposure in production.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `ScalewayTrainHelper.sample`, the inline `from transformers import AutoTokenizer` import is unused and will raise an ImportError unless `transformers` is installed; since `_get_tokenizer` already lazily imports it, you can safely drop this redundant import or explicitly add `transformers` to the appropriate dependency group.
- The Terraform security group currently allows SSH from `0.0.0.0/0` and defaults `caller_ip` to `0.0.0.0/0`, effectively exposing ports 22/8000/8001 publicly; consider tightening these defaults (e.g. restricting to the caller IP by default or requiring explicit configuration) to avoid accidental exposure in production.
## Individual Comments
### Comment 1
<location path="retrain/scaleway_backend.py" line_range="179-185" />
<code_context>
+ # Internals
+ # ------------------------------------------------------------------
+
+ def _wait_healthy(self) -> None:
+ deadline = time.monotonic() + self._health_timeout_s
+ for url in (f"{self._inference_url}/health", f"{self._training_url}/health"):
+ logger.info("Waiting for %s …", url)
+ while True:
+ try:
+ r = self._client.get(url, timeout=5)
+ if r.status_code == 200:
+ break
+ except Exception:
+ pass
+ if time.monotonic() > deadline:
</code_context>
<issue_to_address>
**suggestion:** Swallowing all exceptions during health checks makes debugging connectivity issues harder.
In `_wait_healthy`, any exception from `self._client.get` is silently ignored. This preserves the retry logic but hides important diagnostics (e.g., DNS/TLS failures). Consider logging at least the exception type/message at debug level (or on first occurrence) so users can see why the service never became healthy while keeping the current control flow.
```suggestion
try:
r = self._client.get(url, timeout=5)
if r.status_code == 200:
break
except Exception:
logger.debug("Health check request to %s failed", url, exc_info=True)
if time.monotonic() > deadline:
```
</issue_to_address>
### Comment 2
<location path="retrain/scaleway/terraform/main.tf" line_range="20-28" />
<code_context>
+# Security Group: GPU instance
+# Inference (8000) and training (8001) ports are restricted to caller_ip.
+# -----------------------------------------------------------------------------
+resource "scaleway_instance_security_group" "gpu" {
+ name = "${var.project_name}-sg-gpu"
+ description = "GPU: inference/training ports restricted to caller, SSH via Scaleway IAM keys"
+ zone = var.zone
+ inbound_default_policy = "drop"
+ outbound_default_policy = "accept"
+
+ # SSH — open to all IPs; key-based auth enforced via Scaleway IAM SSH keys
+ inbound_rule {
+ action = "accept"
+ protocol = "TCP"
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Opening SSH (22) to 0.0.0.0/0 may be unnecessarily broad for many users.
Given SSH is open to `0.0.0.0/0`, this increases exposure to internet-wide scans even with key-based auth. Consider parameterizing this (like `caller_ip`) so SSH can be restricted when possible, or introduce an `ssh_cidr` variable with a more restrictive default to improve the default security posture.
Suggested implementation:
```
# SSH — CIDR range configurable via `ssh_cidr` (default can be more restrictive than 0.0.0.0/0)
inbound_rule {
action = "accept"
protocol = "TCP"
port = 22
ip_range = var.ssh_cidr
}
```
To fully implement this change you should also:
1. Define a new `ssh_cidr` variable (for example in `retrain/scaleway/terraform/variables.tf` or alongside your other variables), e.g.:
```hcl
variable "ssh_cidr" {
description = "CIDR block allowed to SSH into the GPU instances (port 22)."
type = string
default = "0.0.0.0/0" # Consider overriding this per-environment with a more restrictive CIDR (e.g. your office/VPN IP).
}
```
2. Optionally set a more restrictive default for `ssh_cidr` in your environment-specific `*.tfvars` files (for example a specific office or VPN CIDR, or the same value as `caller_ip`).
</issue_to_address>Hi @arthemide! 👋
Thanks for trying out Sourcery by commenting with @sourcery-ai review! 🚀
Install the sourcery-ai bot to get automatic code reviews on every pull request ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The training server relies heavily on module-level globals (
_helper,_inference_url,_inference_engine,_adapter_base), which makes reuse and testing harder; consider moving this state into a startup hook or dependency-injected object (e.g., FastAPI dependency or lifespan handler) so you can construct multiple server instances with different settings without global mutation. - In
ScalewayTrainHelper.__del__, the broadhasattr(self, "_runner")+ blanket exception swallowing can hide real teardown issues and may run in an unsafe interpreter-shutdown context; it would be safer to rely on the explicitclose/context manager for normal lifecycle and keep__del__either minimal or logging-only, withTerraformRunner.destroy()already idempotent.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The training server relies heavily on module-level globals (`_helper`, `_inference_url`, `_inference_engine`, `_adapter_base`), which makes reuse and testing harder; consider moving this state into a startup hook or dependency-injected object (e.g., FastAPI dependency or lifespan handler) so you can construct multiple server instances with different settings without global mutation.
- In `ScalewayTrainHelper.__del__`, the broad `hasattr(self, "_runner")` + blanket exception swallowing can hide real teardown issues and may run in an unsafe interpreter-shutdown context; it would be safer to rely on the explicit `close`/context manager for normal lifecycle and keep `__del__` either minimal or logging-only, with `TerraformRunner.destroy()` already idempotent.
## Individual Comments
### Comment 1
<location path="retrain/scaleway/terraform_runner.py" line_range="29-35" />
<code_context>
+ pass
+
+
+def _detect_public_ip() -> str:
+ """Best-effort detection of the caller's public IP via ipify. Returns '0.0.0.0/0' on failure."""
+ try:
+ ip = httpx.get("https://api4.ipify.org", timeout=5).text.strip()
+ return f"{ip}/32"
+ except Exception:
+ return "0.0.0.0/0"
+
+
</code_context>
<issue_to_address>
**🚨 issue (security):** Defaulting to 0.0.0.0/0 on IP detection failure is risky; consider failing fast or requiring explicit caller_ip instead.
If IP detection fails, this silently expands access to `0.0.0.0/0`, effectively exposing ports 8000/8001. The warning log is easy to miss and materially changes the security posture. Please either fail and require an explicit `caller_ip`, or gate the `0.0.0.0/0` fallback behind an explicit opt-in (e.g., env flag) so users don’t unknowingly run with fully open ports when detection breaks (e.g., behind proxies/VPNs).
</issue_to_address>
### Comment 2
<location path="retrain/scaleway/terraform_runner.py" line_range="149-160" />
<code_context>
+ args.extend(pair)
+ return args
+
+ def _run_tf(self, *args: str) -> None:
+ cmd = ["terraform", *args]
+ env = {**os.environ, "TF_DATA_DIR": str(self._state_dir)}
+ result = subprocess.run( # nosec B603 — list argv, shell=False, no user input reaches here
+ cmd,
+ cwd=str(self._tf_dir),
+ env=env,
+ capture_output=False,
+ text=True,
+ )
+ if result.returncode != 0:
+ raise TerraformError(f"terraform {args[0]} exited with code {result.returncode}")
+
+ def _read_outputs(self) -> dict[str, object]:
</code_context>
<issue_to_address>
**suggestion:** Terraform errors lose stderr/stdout context; consider capturing/logging output to aid debugging.
In `_run_tf`, using `capture_output=False` means Terraform’s stderr/stdout is only visible on the console, and `TerraformError` carries only the exit code. In non-interactive or aggregated-log environments, that makes failures hard to diagnose.
Consider enabling `capture_output=True` and, on non-zero exit codes, including trimmed stderr in `TerraformError` or logging it (e.g., via `logger.error`), so `init/apply/destroy` errors (provider/auth/network issues) are visible in logs.
```suggestion
def _run_tf(self, *args: str) -> None:
cmd = ["terraform", *args]
env = {**os.environ, "TF_DATA_DIR": str(self._state_dir)}
result = subprocess.run( # nosec B603 — list argv, shell=False, no user input reaches here
cmd,
cwd=str(self._tf_dir),
env=env,
capture_output=True,
text=True,
)
if result.returncode != 0:
stderr = (result.stderr or "").strip()
stdout = (result.stdout or "").strip()
details = []
if stderr:
details.append(f"stderr:\n{stderr}")
if stdout:
details.append(f"stdout:\n{stdout}")
details_str = "\n".join(details)
if details_str:
raise TerraformError(
f"terraform {args[0]} exited with code {result.returncode}\n{details_str}"
)
raise TerraformError(f"terraform {args[0]} exited with code {result.returncode}")
```
</issue_to_address>
### Comment 3
<location path="retrain/scaleway/training_server.py" line_range="90-99" />
<code_context>
+ return {}
+
+
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest) -> Response:
+ if _helper is None:
+ raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+ # Always write under the server-controlled base dir — client does not control on-disk location
+ saved_path = _helper.save_adapter(_adapter_base, req.name)
+ import io
+ import tarfile
+ buf = io.BytesIO()
+ adapter_dir = Path(saved_path)
+ with tarfile.open(fileobj=buf, mode="w:gz") as tar:
+ for f in adapter_dir.rglob("*"):
+ if f.is_file():
+ tar.add(f, arcname=f.relative_to(adapter_dir))
+ return Response(content=buf.getvalue(), media_type="application/octet-stream")
+
+
</code_context>
<issue_to_address>
**suggestion (performance):** save_adapter loads the entire adapter tarball into memory; consider streaming or temporary files for large adapters.
This approach buffers the entire tarball in memory via `BytesIO`, which can cause significant memory spikes for larger adapters. Instead, consider either streaming the tarball with `StreamingResponse` as you walk the files, or writing to a temporary file and streaming from disk to keep memory usage bounded as adapter sizes grow.
Suggested implementation:
```python
+@app.post("/save_adapter")
+def save_adapter(req: SaveAdapterRequest, background_tasks: BackgroundTasks) -> StreamingResponse:
+ if _helper is None:
+ raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="helper not initialized")
+
+ # Always write under the server-controlled base dir — client does not control on-disk location
+ saved_path = _helper.save_adapter(_adapter_base, req.name)
+
+ adapter_dir = Path(saved_path)
+
+ # Build the tarball on disk to avoid buffering large adapters in memory.
+ with tempfile.NamedTemporaryFile(suffix=".tar.gz", delete=False) as tmp:
+ tmp_name = tmp.name
+ with tarfile.open(fileobj=tmp, mode="w:gz") as tar:
+ for f in adapter_dir.rglob("*"):
+ if f.is_file():
+ tar.add(f, arcname=f.relative_to(adapter_dir))
+
+ def iter_file(path: str, chunk_size: int = 1024 * 1024):
+ with open(path, "rb") as f:
+ while True:
+ chunk = f.read(chunk_size)
+ if not chunk:
+ break
+ yield chunk
+
+ # Ensure the temporary tarball is removed after the response is fully sent.
+ background_tasks.add_task(os.remove, tmp_name)
+
+ return StreamingResponse(
+ iter_file(tmp_name),
+ media_type="application/octet-stream",
+ )
+
+
```
```python
import argparse
import logging
import os
import tempfile
import tarfile
from fastapi import BackgroundTasks
from fastapi.responses import StreamingResponse
```
1. If `tarfile`, `tempfile`, `os`, `BackgroundTasks`, or `StreamingResponse` are already imported elsewhere in this file, remove the duplicate imports from the new block to avoid linter/CI warnings.
2. If your project uses a centralized response type alias or custom response helpers, you may want to align the `save_adapter` return annotation accordingly (e.g., a union including `StreamingResponse`).
3. If you want clients to receive a filename, consider adding a `Content-Disposition` header to the `StreamingResponse` (e.g., `headers={"Content-Disposition": f'attachment; filename="{req.name}.tar.gz"'}`).
</issue_to_address>
### Comment 4
<location path="docs/scaleway-backend-design.md" line_range="158" />
<code_context>
+
+1. `__init__` → `TerraformRunner.apply()` (blocks ~2–5 min, logs progress)
+2. Polls `/health` on both services
+3. Delegates `sample()` to the inference engine (OpenAI-compat, identical for vLLM and SGLang)
+4. Delegates `train_step()` / `checkpoint()` to the Training Server
+5. `checkpoint()` triggers LoRA reload on the inference engine via the appropriate endpoint:
</code_context>
<issue_to_address>
**suggestion (typo):** Use "OpenAI-compatible" instead of "OpenAI-compat" to match more standard wording.
Also update this occurrence to use "OpenAI-compatible" (e.g., "OpenAI-compatible, identical for vLLM and SGLang") to keep terminology consistent and avoid it looking like a typo.
Suggested implementation:
```
3. Delegates `sample()` to the inference engine (OpenAI-compatible, identical for vLLM and SGLang)
```
There may be other occurrences of "OpenAI-compat" elsewhere in this document or in related docs.
To keep terminology consistent and avoid this looking like a typo, search the repo (e.g., `rg "OpenAI-compat" docs/`) and update all instances to "OpenAI-compatible".
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The Scaleway backend currently hardcodes
caller_ipto auto-detected public IP inTerraformRunnerwith no way to override it frombackend_options; consider exposing acaller_ipbackend option so users in locked-down or proxy-only environments can bypass ipify and/or explicitly control the security group source CIDR. - In
training_server.pythere are two FastAPI app instances (appand the one returned bycreate_app), but only the latter hasstate.serverinitialized; consider either removing the unused module-levelappor wiring it throughcreate_appto avoid accidental use of an uninitialized app (e.g., viauvicorn retrain.scaleway.training_server:app).
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The Scaleway backend currently hardcodes `caller_ip` to auto-detected public IP in `TerraformRunner` with no way to override it from `backend_options`; consider exposing a `caller_ip` backend option so users in locked-down or proxy-only environments can bypass ipify and/or explicitly control the security group source CIDR.
- In `training_server.py` there are two FastAPI app instances (`app` and the one returned by `create_app`), but only the latter has `state.server` initialized; consider either removing the unused module-level `app` or wiring it through `create_app` to avoid accidental use of an uninitialized app (e.g., via `uvicorn retrain.scaleway.training_server:app`).
## Individual Comments
### Comment 1
<location path="retrain/scaleway/training_server.py" line_range="52-53" />
<code_context>
+ return _app
+
+
+# Module-level app instance used when launched via CLI (main())
+app = FastAPI(title="retrain-training-server")
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Routes are registered on a different FastAPI instance than the one you run in `main()`.
The route decorators bind to the module-level `app`, while `main()` runs a separate `server = create_app(state)`. That leaves `server` without routes and the global `app` without `state.server`, causing 404s/503s. Please consolidate to a single FastAPI instance, e.g. either construct `app = create_app(default_state)` at module level and set `state.server` there, or set `app.state.server = state` in `main()` and pass `app` to `uvicorn.run` instead of `server`.
</issue_to_address>
### Comment 2
<location path="pyproject.toml" line_range="23" />
<code_context>
[project.optional-dependencies]
local = ["torch>=2.4.0"]
+scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"]
tinker = ["tinker>=0.13.1"]
mlx = ["mlx-lm>=0.20.0; platform_system == 'Darwin'"]
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `fastapi>=0.136` pin is likely ahead of current releases and may be unresolvable for users.
This lower bound likely points to a non-existent FastAPI version and can cause `pip`/`uv` resolution failures when installing the `scaleway` extra. Please either pin to a currently released, tested range (e.g. `fastapi>=0.110,<1.0`) or drop the lower bound until you know the minimum compatible version. The same applies to `uvicorn[standard]>=0.46`, though FastAPI is more likely to fail resolution first.
```suggestion
scaleway = ["httpx>=0.28", "fastapi>=0.110,<1.0", "uvicorn[standard]>=0.22,<1.0", "scaleway>=2.11"]
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # Module-level app instance used when launched via CLI (main()) | ||
| app = FastAPI(title="retrain-training-server") |
There was a problem hiding this comment.
issue (bug_risk): Routes are registered on a different FastAPI instance than the one you run in main().
The route decorators bind to the module-level app, while main() runs a separate server = create_app(state). That leaves server without routes and the global app without state.server, causing 404s/503s. Please consolidate to a single FastAPI instance, e.g. either construct app = create_app(default_state) at module level and set state.server there, or set app.state.server = state in main() and pass app to uvicorn.run instead of server.
|
|
||
| [project.optional-dependencies] | ||
| local = ["torch>=2.4.0"] | ||
| scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"] |
There was a problem hiding this comment.
suggestion (bug_risk): The fastapi>=0.136 pin is likely ahead of current releases and may be unresolvable for users.
This lower bound likely points to a non-existent FastAPI version and can cause pip/uv resolution failures when installing the scaleway extra. Please either pin to a currently released, tested range (e.g. fastapi>=0.110,<1.0) or drop the lower bound until you know the minimum compatible version. The same applies to uvicorn[standard]>=0.46, though FastAPI is more likely to fail resolution first.
| scaleway = ["httpx>=0.28", "fastapi>=0.136", "uvicorn[standard]>=0.46", "scaleway>=2.11"] | |
| scaleway = ["httpx>=0.28", "fastapi>=0.110,<1.0", "uvicorn[standard]>=0.22,<1.0", "scaleway>=2.11"] |
Summary by Sourcery
Add a new Scaleway backend that provisions GPU instances via Terraform, exposes training over HTTP, and integrates with retrain as a remote TrainHelper.
New Features:
Enhancements:
Build:
Documentation:
Tests: