-
Notifications
You must be signed in to change notification settings - Fork 86
PVC Evictor #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PVC Evictor #215
Conversation
|
Follow up optimization plans can be seen at Issue #218 . |
…rove code quality
|
Above commit was following @kfirtoledo's feedback.
|
orozery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks very good overall!
General thoughts:
- Currently the crawler assumes a specific directory structure. Any reason why not remove this dependency?
- Each process logs its own periodic outputs. I think it's cleaner to aggregate all stats to the main evictor process and have it do the periodic outputs.
Thanks!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about the maintainability of deploy.sh long-term. While bash is okay for simple cases, at 100+ lines it becomes unreadable and difficult to debug.
Can we decouple this into a helm chart? It would allow us to define defaults cleanly in values.yaml and let users inject overrides via --set or -f values.yaml without needing custom parsing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I agree - this makes more sense.
The first implementation was more like a PoC, but Helm should fit better within the llm-d framework, will improve consistency and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces the PVC Evictor, a multi-process Kubernetes deployment that automatically manages disk space on PVCs used for vLLM KV cache offloading. The system monitors disk usage and deletes cold cache files when storage thresholds are exceeded, implementing a hot/cold cache eviction strategy based on file access time.
Changes:
- Implements an N+2 process architecture with configurable crawler processes (1, 2, 4, 8, or 16), an activator process for monitoring, and a deleter process for batch file deletion
- Uses multiprocessing with Queue and Event primitives for inter-process communication and coordination
- Provides automated deployment via deploy.sh script with auto-detection of namespace-specific security contexts
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| config.py | Configuration management using environment variables with dataclass-based Config |
| evictor.py | Main coordinator spawning N+2 processes with graceful shutdown handling |
| utils/system.py | Disk usage monitoring via statvfs() and logging setup utilities |
| utils/init.py | Empty initialization module for utils package |
| processes/crawler.py | Crawler processes that discover and queue cache files with hex-based partitioning |
| processes/activator.py | Activator process monitoring disk usage and controlling deletion triggers |
| processes/deleter.py | Deleter process performing batch file deletions using xargs |
| processes/init.py | Empty initialization module for processes package |
| deployment_evictor.yaml | Kubernetes deployment manifest with configurable environment variables |
| deploy.sh | Bash deployment script with auto-detection of security context values |
| Dockerfile | Container image definition using Python 3.12-slim base |
| README.md | Comprehensive documentation covering architecture, configuration, and deployment |
| QUICK_START.md | Quick start guide with deployment examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kv_connectors/pvc_evictor/evictor.py
Outdated
| "file_access_time_threshold_minutes": self.config.file_access_time_threshold_minutes, | ||
| } | ||
|
|
||
| self.logger.info("PVC Cleanup Service v4 (10-Process Architecture) initialized") |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 74 states "10-Process Architecture" but this is misleading since the number of processes is actually N+2 where N is configurable (1, 2, 4, 8, or 16). With the default of 8 crawlers, this would be 10 processes, but the architecture is not fixed at 10 processes. This should be updated to reflect the dynamic nature of the process count.
| self.logger.info("PVC Cleanup Service v4 (10-Process Architecture) initialized") | |
| self.logger.info( | |
| f"PVC Cleanup Service v4 (N+2-Process Architecture: " | |
| f"{config.num_crawler_processes + 2} total processes) initialized" | |
| ) |
|
|
||
| should_process_partial = ( | ||
| current_batch and ( | ||
| (time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The redundant condition "len(current_batch) > 0" in line 202 is unnecessary since line 201 already checks "current_batch" which is falsy when empty. The additional length check adds no value and reduces code clarity.
| (time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or | |
| (time_since_last_check >= partial_batch_timeout) or |
| except: | ||
| pass |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause on line 59 silently catches all exceptions without logging. If there's an issue with the deletion event check or queue status, it will be silently ignored. Consider either being more specific about which exceptions to catch or at least logging that an exception occurred.
| except: | |
| pass | |
| except Exception as e: | |
| logger.warning( | |
| f"Queue status check failed: {e}" | |
| ) |
| if int(time.time()) % 30 == 0: # Every 30 seconds | ||
| logger.debug("Deletion OFF - waiting for trigger") | ||
| time.sleep(0.5) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing-based periodic check on line 264 using int(time.time()) % 30 == 0 may log multiple times or miss the logging window entirely. If the loop iterates rapidly, it could log multiple times within the same second. If it takes longer than 1 second between iterations, it might skip the exact second where the modulo equals 0. Consider using a last_log_time variable to track when the message was last logged, similar to the heartbeat_interval pattern used elsewhere in the code.
| if int(current_time) % 10 == 0: # Every 10 seconds | ||
| try: | ||
| # Try to get queue size (might fail if queue is in different process) | ||
| logger.debug( | ||
| f"Queue status check (deletion={'ON' if deletion_event.is_set() else 'OFF'})" | ||
| ) | ||
| except: | ||
| pass |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing-based periodic check on line 53 using int(current_time) % 10 == 0 has the same issue as found in the deleter process. This could log multiple times per second if the loop is fast, or skip the logging window entirely if iterations take longer than 1 second. Consider using a time-since-last-log pattern instead.
| queue_empty = False | ||
| try: | ||
| queue_empty = file_queue.empty() | ||
| except Exception: | ||
| pass | ||
|
|
||
| should_process_partial = ( | ||
| current_batch and ( | ||
| (time_since_last_check >= partial_batch_timeout and len(current_batch) > 0) or | ||
| queue_empty |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to check if the queue is empty uses a try/except block and sets queue_empty to False by default. However, if an exception occurs, queue_empty remains False, which means the code will assume the queue is NOT empty when in fact the state is unknown. This could lead to partial batches not being processed when they should be. Consider either setting queue_empty to True on exception (conservative approach to process batches), or handling the exception case differently.
| def log_timing(event_type: str, duration_ms: float, **kwargs): | ||
| """Log timing event.""" | ||
| try: | ||
| unix_timestamp = time.time() | ||
| extra_fields = ",".join(f"{k}={v}" for k, v in kwargs.items()) | ||
| log_line = f"TIMING_{event_type}:{unix_timestamp:.3f},{duration_ms:.3f}" | ||
| if extra_fields: | ||
| log_line += f",{extra_fields}" | ||
| logger.debug(log_line) | ||
| except Exception: | ||
| pass | ||
|
|
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable log_timing is not used.
| def log_timing(event_type: str, duration_ms: float, **kwargs): | |
| """Log timing event.""" | |
| try: | |
| unix_timestamp = time.time() | |
| extra_fields = ",".join(f"{k}={v}" for k, v in kwargs.items()) | |
| log_line = f"TIMING_{event_type}:{unix_timestamp:.3f},{duration_ms:.3f}" | |
| if extra_fields: | |
| log_line += f",{extra_fields}" | |
| logger.debug(log_line) | |
| except Exception: | |
| pass |
| if extra_fields: | ||
| log_line += f",{extra_fields}" | ||
| logger.debug(log_line) | ||
| except Exception: |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Best-effort timing log: ignore any logging errors to avoid impacting crawler behavior. |
| except Exception: | ||
| pass |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as e: | |
| # Never let timing/logging failures affect core deletion logic, | |
| # but emit a debug message so issues can be diagnosed if needed. | |
| logger.debug(f"Failed to log timing event '{event_type}': {e}", exc_info=True) |
kv_connectors/pvc_evictor/evictor.py
Outdated
| except Exception: | ||
| pass |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except OSError as exc: | |
| # Continue retrying, but log the error to aid diagnostics. | |
| self.logger.warning( | |
| "Error while checking PVC mount path '%s': %s", | |
| self.config.pvc_mount_path, | |
| exc, | |
| ) |
| @@ -0,0 +1,98 @@ | |||
| # Quick Start Guide - PVC Evictor | |||
|
|
|||
| ## Prerequisites | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start with a description of the quick start
| 1. OpenShift/Kubernetes cluster access | ||
| 2. `kubectl` CLI installed | ||
| 3. PVC exists and is bound | ||
| 4. Appropriate RBAC permissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by appropriate? And why do you need it?
| @@ -0,0 +1,228 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert it to a python script with arguments and default and required items
| **Arguments:** | ||
| - `pvc-name`: Name of the PVC to manage - **Required** (first positional argument) | ||
| - `--namespace=<namespace>`: Kubernetes namespace - **Optional** (auto-detected from `kubectl config` context if not provided) | ||
| - `--fsgroup=<fsgroup>`: Filesystem group ID - **Optional but Recommended** (auto-detected from existing pods/deployments if not provided) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need sgroup, selinux-level, runasuser`? Can you also explain how you can auto-detect it they can be multiple deployment?
|
|
||
| ## Quick Deployment | ||
|
|
||
| ### Option 1: Using deploy.sh (Recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is option 2?
Add an option using yaml with the image configuration
| value: "/tmp/evictor_all_logs.txt" | ||
|
|
||
| volumeMounts: | ||
| - name: kv-cache-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be defined by the user; maybe we can use Helm
|
|
||
| resources: | ||
| requests: | ||
| cpu: 1000m # Base CPU for processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with 1
| cpu: 4000m # Allow more CPU for parallel processing (adjust based on NUM_CRAWLER_PROCESSES) | ||
| memory: 2Gi # Allow more memory for parallel processing | ||
|
|
||
| # Health checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to remove what is not necessary
kv_connectors/pvc_evictor/config.py
Outdated
| cache_directory: str # Subdirectory within PVC containing cache files (default: kv/model-cache/models) | ||
| dry_run: bool # If true, simulate deletion without actually deleting files (default: false) | ||
| log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO) | ||
| timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt, reserved for future use) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it?
kv_connectors/pvc_evictor/config.py
Outdated
| file_access_time_threshold_minutes: ( | ||
| float # Skip files accessed within this time (default: 60.0 minutes) | ||
| ) | ||
| log_file_path: Optional[str] = None # Optional file path to write logs to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need it
|
@kfirtoledo All comments addressed and fixes applied. |
sagearc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job getting this all working!
A quick note on process: This PR is massive (~3k lines). In the future, it would be great to break this down into smaller, stackable PRs so we can review the logic more carefully.
| """Configuration management for PVC Evictor.""" | ||
|
|
||
| import os | ||
| from dataclasses import dataclass | ||
| from typing import Optional | ||
|
|
||
|
|
||
| @dataclass | ||
| class Config: | ||
| """Configuration loaded from environment variables. | ||
| Environment variables are used instead of CLI arguments. | ||
| """ | ||
|
|
||
| pvc_mount_path: str # Mount path of PVC in pod (default: /kv-cache) | ||
| cleanup_threshold: float # Disk usage % to trigger deletion (default: 85.0) | ||
| target_threshold: float # Disk usage % to stop deletion (default: 70.0) | ||
| cache_directory: str # Subdirectory within PVC containing cache files (default: kv/model-cache/models) | ||
| dry_run: bool # If true, simulate deletion without actually deleting files (default: false) | ||
| log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO) | ||
| # timing_file_path: Currently not actively used but kept for backward compatibility and future extensibility | ||
| timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt) | ||
| num_crawler_processes: int # P1-PN (default: 8, valid: 1, 2, 4, 8, 16) | ||
| logger_interval: float # P9 monitoring interval (default: 0.5s) | ||
| file_queue_maxsize: int # Max items in file queue (default: 10000) | ||
| file_queue_min_size: ( | ||
| int # Min queue size to maintain when deletion OFF (default: 1000) | ||
| ) | ||
| deletion_batch_size: int # Files per deletion batch (default: 100) | ||
| file_access_time_threshold_minutes: ( | ||
| float # Skip files accessed within this time (default: 60.0 minutes) | ||
| ) | ||
| aggregated_logging: bool # Enable aggregated logging in main process (default: true) | ||
| aggregated_logging_interval: float # Interval for aggregated log output in seconds (default: 30.0) | ||
| cache_structure_mode: str # Cache structure mode: "file_mapper" (default) or "vllm" (legacy) | ||
| # log_file_path: Optional file logging for persistent log storage and debugging | ||
| log_file_path: Optional[str] = None # Optional file path to write logs to (default: None, stdout only) | ||
|
|
||
| @classmethod | ||
| def from_env(cls) -> "Config": | ||
| """Load configuration from environment variables.""" | ||
| return cls( | ||
| pvc_mount_path=os.getenv("PVC_MOUNT_PATH", "/kv-cache"), | ||
| cleanup_threshold=float(os.getenv("CLEANUP_THRESHOLD", "85.0")), | ||
| target_threshold=float(os.getenv("TARGET_THRESHOLD", "70.0")), | ||
| cache_directory=os.getenv("CACHE_DIRECTORY", "kv/model-cache/models"), | ||
| dry_run=os.getenv("DRY_RUN", "false").lower() == "true", | ||
| log_level=os.getenv("LOG_LEVEL", "INFO"), | ||
| timing_file_path=os.getenv("TIMING_FILE_PATH", "/tmp/timing_analysis.txt"), | ||
| num_crawler_processes=int(os.getenv("NUM_CRAWLER_PROCESSES", "8")), | ||
| logger_interval=float(os.getenv("LOGGER_INTERVAL_SECONDS", "0.5")), | ||
| file_queue_maxsize=int(os.getenv("FILE_QUEUE_MAXSIZE", "10000")), | ||
| file_queue_min_size=int(os.getenv("FILE_QUEUE_MIN_SIZE", "1000")), | ||
| deletion_batch_size=int(os.getenv("DELETION_BATCH_SIZE", "100")), | ||
| log_file_path=os.getenv("LOG_FILE_PATH", None), | ||
| file_access_time_threshold_minutes=float( | ||
| os.getenv("FILE_ACCESS_TIME_THRESHOLD_MINUTES", "60.0") | ||
| ), | ||
| aggregated_logging=os.getenv("AGGREGATED_LOGGING", "true").lower() == "true", | ||
| aggregated_logging_interval=float(os.getenv("AGGREGATED_LOGGING_INTERVAL", "30.0")), | ||
| cache_structure_mode=os.getenv("CACHE_STRUCTURE_MODE", "file_mapper"), | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is config.Config being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config is imported in evictor.py and used to initialize PVCEvictor using these settings, keeping the configuration logic separate from the main code.
| ``` | ||
|
|
||
| For complete Helm documentation, see [helm/README.md](helm/README.md). | ||
| ### Option 2: Using deploy.sh (Legacy - Automated Script) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this script is being introduced in this PR but is already labeled as 'Legacy.' We generally shouldn't merge code that is deprecated upon arrival. Since this file wasn't in the codebase previously, let's remove it entirely and stick to the Helm chart as the single source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, awaiting final confirmation from Kfir regarding the file_mapper.py structure and then legacies will be removed completely.
kfirtoledo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. some minor fixes
|
|
||
| | Parameter | Description | Default | | ||
| |-----------|-------------|---------| | ||
| | `image.repository` | Container image repository | `ghcr.io/guygir/pvc-evictor` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we open in quay for now, a more general place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that the image should not remain in my personal ghcr.io account. Will update all image references once the hosting location is decided.
kv_connectors/pvc_evictor/README.md
Outdated
|
|
||
| *Figure: PVC Evictor multi-process architecture showing N crawler processes, activator, deleter, and IPC mechanisms (Queue and Events)* | ||
|
|
||
| ### N+2 Process Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe combine this with architecture and process details - I don;t want the README willbe too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it more concise.
kv_connectors/pvc_evictor/README.md
Outdated
|
|
||
| ### Hot Cache Configuration | ||
|
|
||
| #### `FILE_ACCESS_TIME_THRESHOLD_MINUTES` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine an extreme case where the cache fills the disk quickly, despite they are within the FILE_ACCESS_TIME_THRESHOLD_MINUTES. Should the evictor force deleting files if the size exceeds a "hard threshold" such as 95%?
What's the behavior today if this happens? Failing to write new cache files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - failing to write new cache files is the behavior today. I've updated the documentation to address this.
For the initial release, the soft threshold approach is designed for large storage deployments where this edge case is unlikely.
Although issue #218 tracks optimizations for the PVC Evictor, which will handle this edge case by moving from a binary cache model to a continuum, so this issue won’t occur.
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. OpenShift/Kubernetes cluster access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything specific for Openshift, or does this generally work for any k8s cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work on any k8s cluster. OpenShift is mentioned because it’s what we use in our test environment. The examples use kubectl to show it works everywhere.
| 3. Helm 3.0+ installed (for Helm deployment) | ||
| 4. PVC exists and is bound | ||
| 5. Appropriate RBAC permissions to create deployments | ||
| 6. **Docker image available** - The evictor uses `ghcr.io/guygir/pvc-evictor:latest` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we host this image in llm-d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that the image should not remain in my personal ghcr.io account. Will update all image references once the hosting location is decided.
| log_level: str # Logging verbosity: DEBUG, INFO, WARNING, ERROR (default: INFO) | ||
| # timing_file_path: Currently not actively used but kept for backward compatibility and future extensibility | ||
| timing_file_path: str # Path for timing analysis file (default: /tmp/timing_analysis.txt) | ||
| num_crawler_processes: int # P1-PN (default: 8, valid: 1, 2, 4, 8, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be power of 2? can this be, say, 6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Power of 2 workers are recommended for best performance, not required.
The crawler partitions work into 16 hex buckets. With power-of-2 worker counts, buckets are (±) evenly distributed. Non-power-of-2 counts still work but may cause uneven load.
Updated the documentation to clarify this.
| ``` | ||
|
|
||
| For complete Helm documentation, see [helm/README.md](helm/README.md). | ||
| ### Option 2: Using deploy.sh (Legacy - Automated Script) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new tool right? What's the "legacy" here?
I suggest removing this option given the yaml and helm options. Unless there is a specific reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that llmd_fs_backend is integrated, there is only one standard approach, so legacy code is no longer needed and has been fully removed.
| - Default: `8` | ||
| - Valid values: 1, 2, 4, 8, 16 | ||
| - More crawlers = faster file discovery on large directories | ||
| - Recommendation: Use 8-16 for multi-TB volumes, 1-4 for smaller volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some benchmark data for these recommended values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are based on empirical testing during development, not formal benchmarks, and are meant as starting points.
Optimal settings depend on storage performance, file characteristics, etc.
Update documentation to mark them as recommended starting values, but should be tuned per workload.
Proper benchmarking is tracked in issue #218 .
| # kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.securityContext}' | ||
| # kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].securityContext}' | ||
| securityContext: | ||
| fsGroup: <your_fsgroup> # REQUIRED: Set to your namespace's fsGroup. Will try to auto-detect if not provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this match the security context of the vllm pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the security context must match the vllm pods. The PVC is written by vllm with a specific UID/GID, and the evictor needs the same UID/GID to read and delete files. Mismatches will cause permission denied errors.
…bugs, refactor stats sending and update documentation.
|
All feedback has been addressed. The only remaining item is deciding the official image hosting location. |
The PVC Evictor is a multi process kubernetes deployment designed to automatically manage disk space on PVCs used for vLLM KV cache offloading. It monitors disk usage and automatically deletes cold cache files when storage thresholds are exceeded, ensuring continuous operation of vLLM workloads without manual intervention.