Skip to content

Conversation

@guygir
Copy link
Collaborator

@guygir guygir commented Dec 15, 2025

The PVC Evictor is a multi process kubernetes deployment designed to automatically manage disk space on PVCs used for vLLM KV cache offloading. It monitors disk usage and automatically deletes cold cache files when storage thresholds are exceeded, ensuring continuous operation of vLLM workloads without manual intervention.

  • Follow up optimization plans can be seen at Issue PVC Evictor optimizations #218 .
  • Once complete benchmark results on llm-d will be available, I'll post them as a comment here.

@kfirtoledo kfirtoledo removed the request for review from elevran December 15, 2025 18:03
@guygir
Copy link
Collaborator Author

guygir commented Dec 16, 2025

Follow up optimization plans can be seen at Issue #218 .
Currently working on addressing Kfir's feedback.

@guygir
Copy link
Collaborator Author

guygir commented Dec 18, 2025

Above commit was following @kfirtoledo's feedback.
The changes maintain all existing functionality, they're mostly about maintainability and design practices.

  • Code Structure: Split pvc_evictor.py into modular components: config.py for configuration management, processes/ directory for crawler/activator/deleter processes, utils/ for system utilities (mostly logging), and evictor.py as the main process.

  • Deployment: Moved to Docker image deployment (currently stored at ghcr.io/guygir/pvc-evictor:latest). Simplified deploy.sh script. Changed all oc commands to kubectl.

  • Code Quality: Renamed variables, moved non-critical logs to DEBUG level, added inline design decision comments and renamed processes.

  • Logic: Simplified cache path structure and exception handling. Removed overkill fallbacks.

  • Documentation: Updated README.md and QUICK_START.md to reflect (hopefully) all changes.

Copy link
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks very good overall!
General thoughts:

  1. Currently the crawler assumes a specific directory structure. Any reason why not remove this dependency?
  2. Each process logs its own periodic outputs. I think it's cleaner to aggregate all stats to the main evictor process and have it do the periodic outputs.

@guygir
Copy link
Collaborator Author

guygir commented Jan 4, 2026

Code looks very good overall! General thoughts:

  1. Currently the crawler assumes a specific directory structure. Any reason why not remove this dependency?
  2. Each process logs its own periodic outputs. I think it's cleaner to aggregate all stats to the main evictor process and have it do the periodic outputs.

Thanks!

  1. The hardcoded structure was mainly for efficiency (the offload connector uses a known structure) and safety (to avoid deleting unwanted files in the folder - this concern may be an overkill...). It could be that flexibility is more important.
  2. I think it's a good idea. The original design was due to process isolation, and to make sure that important logs could be sent immediately without coordination overhead. But this does lead to messier logs. Aggregated logging would be cleaner (maybe apart from crucial warning/error logs - but even that separation may be unnecessary)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the maintainability of deploy.sh long-term. While bash is okay for simple cases, at 100+ lines it becomes unreadable and difficult to debug.

Can we decouple this into a helm chart? It would allow us to define defaults cleanly in values.yaml and let users inject overrides via --set or -f values.yaml without needing custom parsing logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I agree - this makes more sense.
The first implementation was more like a PoC, but Helm should fit better within the llm-d framework, will improve consistency and maintainability.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the PVC Evictor, a multi-process Kubernetes deployment that automatically manages disk space on PVCs used for vLLM KV cache offloading. The system monitors disk usage and deletes cold cache files when storage thresholds are exceeded, implementing a hot/cold cache eviction strategy based on file access time.

Changes:

  • Implements an N+2 process architecture with configurable crawler processes (1, 2, 4, 8, or 16), an activator process for monitoring, and a deleter process for batch file deletion
  • Uses multiprocessing with Queue and Event primitives for inter-process communication and coordination
  • Provides automated deployment via deploy.sh script with auto-detection of namespace-specific security contexts

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
config.py Configuration management using environment variables with dataclass-based Config
evictor.py Main coordinator spawning N+2 processes with graceful shutdown handling
utils/system.py Disk usage monitoring via statvfs() and logging setup utilities
utils/init.py Empty initialization module for utils package
processes/crawler.py Crawler processes that discover and queue cache files with hex-based partitioning
processes/activator.py Activator process monitoring disk usage and controlling deletion triggers
processes/deleter.py Deleter process performing batch file deletions using xargs
processes/init.py Empty initialization module for processes package
deployment_evictor.yaml Kubernetes deployment manifest with configurable environment variables
deploy.sh Bash deployment script with auto-detection of security context values
Dockerfile Container image definition using Python 3.12-slim base
README.md Comprehensive documentation covering architecture, configuration, and deployment
QUICK_START.md Quick start guide with deployment examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"file_access_time_threshold_minutes": self.config.file_access_time_threshold_minutes,
}

self.logger.info("PVC Cleanup Service v4 (10-Process Architecture) initialized")
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 74 states "10-Process Architecture" but this is misleading since the number of processes is actually N+2 where N is configurable (1, 2, 4, 8, or 16). With the default of 8 crawlers, this would be 10 processes, but the architecture is not fixed at 10 processes. This should be updated to reflect the dynamic nature of the process count.

Suggested change
self.logger.info("PVC Cleanup Service v4 (10-Process Architecture) initialized")
self.logger.info(
f"PVC Cleanup Service v4 (N+2-Process Architecture: "
f"{config.num_crawler_processes + 2} total processes) initialized"
)

Copilot uses AI. Check for mistakes.

should_process_partial = (
current_batch and (
(time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redundant condition "len(current_batch) > 0" in line 202 is unnecessary since line 201 already checks "current_batch" which is falsy when empty. The additional length check adds no value and reduces code clarity.

Suggested change
(time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or
(time_since_last_check >= partial_batch_timeout) or

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 60
except:
pass
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause on line 59 silently catches all exceptions without logging. If there's an issue with the deletion event check or queue status, it will be silently ignored. Consider either being more specific about which exceptions to catch or at least logging that an exception occurred.

Suggested change
except:
pass
except Exception as e:
logger.warning(
f"Queue status check failed: {e}"
)

Copilot uses AI. Check for mistakes.
Comment on lines 264 to 266
if int(time.time()) % 30 == 0: # Every 30 seconds
logger.debug("Deletion OFF - waiting for trigger")
time.sleep(0.5)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing-based periodic check on line 264 using int(time.time()) % 30 == 0 may log multiple times or miss the logging window entirely. If the loop iterates rapidly, it could log multiple times within the same second. If it takes longer than 1 second between iterations, it might skip the exact second where the modulo equals 0. Consider using a last_log_time variable to track when the message was last logged, similar to the heartbeat_interval pattern used elsewhere in the code.

Copilot uses AI. Check for mistakes.
Comment on lines 53 to 60
if int(current_time) % 10 == 0: # Every 10 seconds
try:
# Try to get queue size (might fail if queue is in different process)
logger.debug(
f"Queue status check (deletion={'ON' if deletion_event.is_set() else 'OFF'})"
)
except:
pass
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing-based periodic check on line 53 using int(current_time) % 10 == 0 has the same issue as found in the deleter process. This could log multiple times per second if the loop is fast, or skip the logging window entirely if iterations take longer than 1 second. Consider using a time-since-last-log pattern instead.

Copilot uses AI. Check for mistakes.
Comment on lines 194 to 203
queue_empty = False
try:
queue_empty = file_queue.empty()
except Exception:
pass

should_process_partial = (
current_batch and (
(time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or
queue_empty
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to check if the queue is empty uses a try/except block and sets queue_empty to False by default. However, if an exception occurs, queue_empty remains False, which means the code will assume the queue is NOT empty when in fact the state is unknown. This could lead to partial batches not being processed when they should be. Consider either setting queue_empty to True on exception (conservative approach to process batches), or handling the exception case differently.

Copilot uses AI. Check for mistakes.
Comment on lines 236 to 247
def log_timing(event_type: str, duration_ms: float, **kwargs):
"""Log timing event."""
try:
unix_timestamp = time.time()
extra_fields = ",".join(f"{k}={v}" for k, v in kwargs.items())
log_line = f"TIMING_{event_type}:{unix_timestamp:.3f},{duration_ms:.3f}"
if extra_fields:
log_line += f",{extra_fields}"
logger.debug(log_line)
except Exception:
pass

Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable log_timing is not used.

Suggested change
def log_timing(event_type: str, duration_ms: float, **kwargs):
"""Log timing event."""
try:
unix_timestamp = time.time()
extra_fields = ",".join(f"{k}={v}" for k, v in kwargs.items())
log_line = f"TIMING_{event_type}:{unix_timestamp:.3f},{duration_ms:.3f}"
if extra_fields:
log_line += f",{extra_fields}"
logger.debug(log_line)
except Exception:
pass

Copilot uses AI. Check for mistakes.
if extra_fields:
log_line += f",{extra_fields}"
logger.debug(log_line)
except Exception:
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# Best-effort timing log: ignore any logging errors to avoid impacting crawler behavior.

Copilot uses AI. Check for mistakes.
Comment on lines 124 to 125
except Exception:
pass
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as e:
# Never let timing/logging failures affect core deletion logic,
# but emit a debug message so issues can be diagnosed if needed.
logger.debug(f"Failed to log timing event '{event_type}': {e}", exc_info=True)

Copilot uses AI. Check for mistakes.
Comment on lines 107 to 108
except Exception:
pass
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except OSError as exc:
# Continue retrying, but log the error to aid diagnostics.
self.logger.warning(
"Error while checking PVC mount path '%s': %s",
self.config.pvc_mount_path,
exc,
)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,98 @@
# Quick Start Guide - PVC Evictor

## Prerequisites
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start with a description of the quick start

1. OpenShift/Kubernetes cluster access
2. `kubectl` CLI installed
3. PVC exists and is bound
4. Appropriate RBAC permissions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by appropriate? And why do you need it?

@@ -0,0 +1,228 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert it to a python script with arguments and default and required items

**Arguments:**
- `pvc-name`: Name of the PVC to manage - **Required** (first positional argument)
- `--namespace=<namespace>`: Kubernetes namespace - **Optional** (auto-detected from `kubectl config` context if not provided)
- `--fsgroup=<fsgroup>`: Filesystem group ID - **Optional but Recommended** (auto-detected from existing pods/deployments if not provided)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need sgroup, selinux-level, runasuser`? Can you also explain how you can auto-detect it they can be multiple deployment?


## Quick Deployment

### Option 1: Using deploy.sh (Recommended)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is option 2?
Add an option using yaml with the image configuration

value: "/tmp/evictor_all_logs.txt"

volumeMounts:
- name: kv-cache-storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be defined by the user; maybe we can use Helm


resources:
requests:
cpu: 1000m # Base CPU for processes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with 1

cpu: 4000m # Allow more CPU for parallel processing (adjust based on NUM_CRAWLER_PROCESSES)
memory: 2Gi # Allow more memory for parallel processing

# Health checks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to remove what is not necessary

cache_directory: str # Subdirectory within PVC containing cache files (default: kv/model-cache/models)
dry_run: bool # If true, simulate deletion without actually deleting files (default: false)
log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO)
timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt, reserved for future use)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it?

file_access_time_threshold_minutes: (
float # Skip files accessed within this time (default: 60.0 minutes)
)
log_file_path: Optional[str] = None # Optional file path to write logs to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need it

@guygir
Copy link
Collaborator Author

guygir commented Jan 22, 2026

@kfirtoledo All comments addressed and fixes applied.

Copy link
Collaborator

@sagearc sagearc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job getting this all working!

A quick note on process: This PR is massive (~3k lines). In the future, it would be great to break this down into smaller, stackable PRs so we can review the logic more carefully.

Comment on lines 1 to 62
"""Configuration management for PVC Evictor."""

import os
from dataclasses import dataclass
from typing import Optional


@dataclass
class Config:
"""Configuration loaded from environment variables.
Environment variables are used instead of CLI arguments.
"""

pvc_mount_path: str # Mount path of PVC in pod (default: /kv-cache)
cleanup_threshold: float # Disk usage % to trigger deletion (default: 85.0)
target_threshold: float # Disk usage % to stop deletion (default: 70.0)
cache_directory: str # Subdirectory within PVC containing cache files (default: kv/model-cache/models)
dry_run: bool # If true, simulate deletion without actually deleting files (default: false)
log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO)
# timing_file_path: Currently not actively used but kept for backward compatibility and future extensibility
timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt)
num_crawler_processes: int # P1-PN (default: 8, valid: 1, 2, 4, 8, 16)
logger_interval: float # P9 monitoring interval (default: 0.5s)
file_queue_maxsize: int # Max items in file queue (default: 10000)
file_queue_min_size: (
int # Min queue size to maintain when deletion OFF (default: 1000)
)
deletion_batch_size: int # Files per deletion batch (default: 100)
file_access_time_threshold_minutes: (
float # Skip files accessed within this time (default: 60.0 minutes)
)
aggregated_logging: bool # Enable aggregated logging in main process (default: true)
aggregated_logging_interval: float # Interval for aggregated log output in seconds (default: 30.0)
cache_structure_mode: str # Cache structure mode: "file_mapper" (default) or "vllm" (legacy)
# log_file_path: Optional file logging for persistent log storage and debugging
log_file_path: Optional[str] = None # Optional file path to write logs to (default: None, stdout only)

@classmethod
def from_env(cls) -> "Config":
"""Load configuration from environment variables."""
return cls(
pvc_mount_path=os.getenv("PVC_MOUNT_PATH", "/kv-cache"),
cleanup_threshold=float(os.getenv("CLEANUP_THRESHOLD", "85.0")),
target_threshold=float(os.getenv("TARGET_THRESHOLD", "70.0")),
cache_directory=os.getenv("CACHE_DIRECTORY", "kv/model-cache/models"),
dry_run=os.getenv("DRY_RUN", "false").lower() == "true",
log_level=os.getenv("LOG_LEVEL", "INFO"),
timing_file_path=os.getenv("TIMING_FILE_PATH", "/tmp/timing_analysis.txt"),
num_crawler_processes=int(os.getenv("NUM_CRAWLER_PROCESSES", "8")),
logger_interval=float(os.getenv("LOGGER_INTERVAL_SECONDS", "0.5")),
file_queue_maxsize=int(os.getenv("FILE_QUEUE_MAXSIZE", "10000")),
file_queue_min_size=int(os.getenv("FILE_QUEUE_MIN_SIZE", "1000")),
deletion_batch_size=int(os.getenv("DELETION_BATCH_SIZE", "100")),
log_file_path=os.getenv("LOG_FILE_PATH", None),
file_access_time_threshold_minutes=float(
os.getenv("FILE_ACCESS_TIME_THRESHOLD_MINUTES", "60.0")
),
aggregated_logging=os.getenv("AGGREGATED_LOGGING", "true").lower() == "true",
aggregated_logging_interval=float(os.getenv("AGGREGATED_LOGGING_INTERVAL", "30.0")),
cache_structure_mode=os.getenv("CACHE_STRUCTURE_MODE", "file_mapper"),
)

Copy link
Collaborator

@sagearc sagearc Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is config.Config being used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config is imported in evictor.py and used to initialize PVCEvictor using these settings, keeping the configuration logic separate from the main code.

```

For complete Helm documentation, see [helm/README.md](helm/README.md).
### Option 2: Using deploy.sh (Legacy - Automated Script)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this script is being introduced in this PR but is already labeled as 'Legacy.' We generally shouldn't merge code that is deprecated upon arrival. Since this file wasn't in the codebase previously, let's remove it entirely and stick to the Helm chart as the single source of truth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, awaiting final confirmation from Kfir regarding the file_mapper.py structure and then legacies will be removed completely.

Copy link
Collaborator

@kfirtoledo kfirtoledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. some minor fixes


| Parameter | Description | Default |
|-----------|-------------|---------|
| `image.repository` | Container image repository | `ghcr.io/guygir/pvc-evictor` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open in quay for now, a more general place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the image should not remain in my personal ghcr.io account. Will update all image references once the hosting location is decided.


*Figure: PVC Evictor multi-process architecture showing N crawler processes, activator, deleter, and IPC mechanisms (Queue and Events)*

### N+2 Process Design
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe combine this with architecture and process details - I don;t want the README willbe too long

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it more concise.


### Hot Cache Configuration

#### `FILE_ACCESS_TIME_THRESHOLD_MINUTES`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine an extreme case where the cache fills the disk quickly, despite they are within the FILE_ACCESS_TIME_THRESHOLD_MINUTES. Should the evictor force deleting files if the size exceeds a "hard threshold" such as 95%?

What's the behavior today if this happens? Failing to write new cache files?

Copy link
Collaborator Author

@guygir guygir Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - failing to write new cache files is the behavior today. I've updated the documentation to address this.

For the initial release, the soft threshold approach is designed for large storage deployments where this edge case is unlikely.

Although issue #218 tracks optimizations for the PVC Evictor, which will handle this edge case by moving from a binary cache model to a continuum, so this issue won’t occur.


## Prerequisites

1. OpenShift/Kubernetes cluster access
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything specific for Openshift, or does this generally work for any k8s cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work on any k8s cluster. OpenShift is mentioned because it’s what we use in our test environment. The examples use kubectl to show it works everywhere.

3. Helm 3.0+ installed (for Helm deployment)
4. PVC exists and is bound
5. Appropriate RBAC permissions to create deployments
6. **Docker image available** - The evictor uses `ghcr.io/guygir/pvc-evictor:latest`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we host this image in llm-d?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the image should not remain in my personal ghcr.io account. Will update all image references once the hosting location is decided.

log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO)
# timing_file_path: Currently not actively used but kept for backward compatibility and future extensibility
timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt)
num_crawler_processes: int # P1-PN (default: 8, valid: 1, 2, 4, 8, 16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be power of 2? can this be, say, 6?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power of 2 workers are recommended for best performance, not required.
The crawler partitions work into 16 hex buckets. With power-of-2 worker counts, buckets are (±) evenly distributed. Non-power-of-2 counts still work but may cause uneven load.

Updated the documentation to clarify this.

```

For complete Helm documentation, see [helm/README.md](helm/README.md).
### Option 2: Using deploy.sh (Legacy - Automated Script)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new tool right? What's the "legacy" here?

I suggest removing this option given the yaml and helm options. Unless there is a specific reason

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that llmd_fs_backend is integrated, there is only one standard approach, so legacy code is no longer needed and has been fully removed.

- Default: `8`
- Valid values: 1, 2, 4, 8, 16
- More crawlers = faster file discovery on large directories
- Recommendation: Use 8-16 for multi-TB volumes, 1-4 for smaller volumes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have some benchmark data for these recommended values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are based on empirical testing during development, not formal benchmarks, and are meant as starting points.
Optimal settings depend on storage performance, file characteristics, etc.
Update documentation to mark them as recommended starting values, but should be tuned per workload.
Proper benchmarking is tracked in issue #218 .

# kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.securityContext}'
# kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].securityContext}'
securityContext:
fsGroup: <your_fsgroup> # REQUIRED: Set to your namespace's fsGroup. Will try to auto-detect if not provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this match the security context of the vllm pods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the security context must match the vllm pods. The PVC is written by vllm with a specific UID/GID, and the evictor needs the same UID/GID to read and delete files. Mismatches will cause permission denied errors.

…bugs, refactor stats sending and update documentation.
@vMaroon vMaroon requested review from liu-cong and sagearc February 8, 2026 19:22
@guygir
Copy link
Collaborator Author

guygir commented Feb 8, 2026

All feedback has been addressed. The only remaining item is deciding the official image hosting location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants