feat: add support for configurable step executors in Argo backend#221
feat: add support for configurable step executors in Argo backend#221
Conversation
- Introduced `PROMETHEUS_GATEWAY` environment variable to determine the step executor. - Updated `ArgoBackend` to accept an optional executor parameter in `from_values` and constructor. - Enhanced CLI to allow specifying the executor via command line. - Added tests to verify executor selection logic based on environment variable and explicit settings.
There was a problem hiding this comment.
Pull request overview
Adds configurability for which step executor class is embedded into generated wurzel run commands, with Argo defaulting based on PROMETHEUS_GATEWAY and a CLI override.
Changes:
- Add an optional
executorparameter to backend construction paths (from_valuesand direct constructor usage). - Make Argo backend select a default step executor based on
PROMETHEUS_GATEWAY, while still allowing explicit override. - Extend
wurzel generateCLI and tests to validate executor selection and propagation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
wurzel/cli/cmd_generate.py |
Thread executor selection through backend resolution and generation entrypoint. |
wurzel/cli/_main.py |
Add --executor option to generate and pass resolved executor into generation. |
wurzel/backend/backend_dvc.py |
Allow executor override when instantiating from values files. |
wurzel/backend/backend_argo.py |
Add env-driven default executor selection and propagate executor into generated CLI calls. |
tests/cli/test_cmd_generate.py |
Verify executor is forwarded through backend resolution and command generation. |
tests/backend/test_backend_argo.py |
Assert generated Argo container commands include -e ... for default/env/explicit executor cases. |
tests/backend/conftest.py |
Clear PROMETHEUS_GATEWAY between backend tests to avoid env leakage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
…te CLI option Agent-Logs-Url: https://github.com/telekom/wurzel/sessions/716ae7ca-5a91-40f3-87f9-b553976238dd Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
🎉 Pipeline Test ResultsThe e2e pipeline test completed successfully! Sample Output DocumentClick to view sample output from SimpleSplitterStep[
{
"md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
"keywords": "introduction",
"url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
"metadata": {
"token_len": 300,
"char_len": 1456,
"source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n def run(self, input_data: InputContract) -> OutputContract:\n # Your processing logic here\n return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
"keywords": "architecture",
"url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
"metadata": {
"token_len": 387,
"char_len": 1895,
"source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
"chunk_index": 0,
"chunks_count": 1
}
},
{
"md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
"keywords": "setup-guide",
"url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
"metadata": {
"token_len": 343,
"char_len": 1509,
"source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
"chunk_index": 0,
"chunks_count": 1
}
}
] |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PROMETHEUS_GATEWAYenvironment variable to determine the step executor.ArgoBackendto accept an optional executor parameter infrom_valuesand constructor.Description
Checklist