Skip to content

feat: add support for configurable step executors in Argo backend#221

Merged
sam-hey merged 3 commits intomainfrom
fix/argo_prometheus
Apr 13, 2026
Merged

feat: add support for configurable step executors in Argo backend#221
sam-hey merged 3 commits intomainfrom
fix/argo_prometheus

Conversation

@sam-hey
Copy link
Copy Markdown
Collaborator

@sam-hey sam-hey commented Apr 13, 2026

  • Introduced PROMETHEUS_GATEWAY environment variable to determine the step executor.
  • Updated ArgoBackend to accept an optional executor parameter in from_values and constructor.
  • Enhanced CLI to allow specifying the executor via command line.
  • Added tests to verify executor selection logic based on environment variable and explicit settings.

Description

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have run the linter and ensured the code is formatted correctly
  • I have updated the documentation accordingly

- Introduced `PROMETHEUS_GATEWAY` environment variable to determine the step executor.
- Updated `ArgoBackend` to accept an optional executor parameter in `from_values` and constructor.
- Enhanced CLI to allow specifying the executor via command line.
- Added tests to verify executor selection logic based on environment variable and explicit settings.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurability for which step executor class is embedded into generated wurzel run commands, with Argo defaulting based on PROMETHEUS_GATEWAY and a CLI override.

Changes:

  • Add an optional executor parameter to backend construction paths (from_values and direct constructor usage).
  • Make Argo backend select a default step executor based on PROMETHEUS_GATEWAY, while still allowing explicit override.
  • Extend wurzel generate CLI and tests to validate executor selection and propagation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
wurzel/cli/cmd_generate.py Thread executor selection through backend resolution and generation entrypoint.
wurzel/cli/_main.py Add --executor option to generate and pass resolved executor into generation.
wurzel/backend/backend_dvc.py Allow executor override when instantiating from values files.
wurzel/backend/backend_argo.py Add env-driven default executor selection and propagate executor into generated CLI calls.
tests/cli/test_cmd_generate.py Verify executor is forwarded through backend resolution and command generation.
tests/backend/test_backend_argo.py Assert generated Argo container commands include -e ... for default/env/explicit executor cases.
tests/backend/conftest.py Clear PROMETHEUS_GATEWAY between backend tests to avoid env leakage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Pipeline Test Results

The e2e pipeline test completed successfully!

Sample Output Document

Click to view sample output from SimpleSplitterStep
[
  {
    "md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n   This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
    "keywords": "introduction",
    "url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
    "metadata": {
      "token_len": 300,
      "char_len": 1456,
      "source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n    def run(self, input_data: InputContract) -> OutputContract:\n        # Your processing logic here\n        return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
    "keywords": "architecture",
    "url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
    "metadata": {
      "token_len": 387,
      "char_len": 1895,
      "source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
    "keywords": "setup-guide",
    "url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
    "metadata": {
      "token_len": 343,
      "char_len": 1509,
      "source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
      "chunk_index": 0,
      "chunks_count": 1
    }
  }
]

…te CLI option

Agent-Logs-Url: https://github.com/telekom/wurzel/sessions/716ae7ca-5a91-40f3-87f9-b553976238dd

Co-authored-by: sam-hey <40773225+sam-hey@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🎉 Pipeline Test Results

The e2e pipeline test completed successfully!

Sample Output Document

Click to view sample output from SimpleSplitterStep
[
  {
    "md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n   This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
    "keywords": "introduction",
    "url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
    "metadata": {
      "token_len": 300,
      "char_len": 1456,
      "source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n    def run(self, input_data: InputContract) -> OutputContract:\n        # Your processing logic here\n        return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
    "keywords": "architecture",
    "url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
    "metadata": {
      "token_len": 387,
      "char_len": 1895,
      "source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
    "keywords": "setup-guide",
    "url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
    "metadata": {
      "token_len": 343,
      "char_len": 1509,
      "source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
      "chunk_index": 0,
      "chunks_count": 1
    }
  }
]

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Pipeline Test Results

The e2e pipeline test completed successfully!

Sample Output Document

Click to view sample output from SimpleSplitterStep
[
  {
    "md": "# Introduction to Wurzel\n\nWelcome to Wurzel, an advanced ETL framework designed specifically for Retrieval-Augmented Generation (RAG) systems.\n\n## What is Wurzel?\n\nWurzel is a Python library that streamlines the process of building data pipelines for RAG applications. It provides:\n\n- **Type-safe pipeline definitions** using Pydantic and Pandera\n- **Modular step architecture** for easy composition and reuse\n- **Built-in support** for popular vector databases like Qdrant and Milvus\n- **Cloud-native deployment** capabilities with Docker and Kubernetes\n- **DVC integration** for data versioning and pipeline orchestration\n\n## Key Features\n\n### Pipeline Composition\n\nBuild complex data processing pipelines by chaining simple, reusable steps together.\n\n### Vector Database Support\n\nOut-of-the-box integration with:\n\n- Qdrant for high-performance vector search\n- Milvus for scalable vector databases\n- Easy extension for other vector stores\n\n### Document Processing\n\nAdvanced document processing capabilities including:\n\n- PDF extraction with Docling\n- Markdown processing and splitting\n- Text embedding generation\n- Duplicate detection and removal\n\n## Getting Started\n\nTo create your first Wurzel pipeline:\n\n1. Define your data processing steps\n1. Chain them together using the `>>` operator\n1. Configure your environment variables\n1. Run with DVC or Argo Workflows\n   This demo shows a simple pipeline that processes markdown documents and prepares them for vector storage.",
    "keywords": "introduction",
    "url": "ManualMarkdownStep//usr/app/demo-data/introduction.md",
    "metadata": {
      "token_len": 300,
      "char_len": 1456,
      "source_sha256_hash": "f81ab0ce39ef126c6626ea8db0424a916006d0acdd4f6f661447a8324ec1b68c",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Wurzel Pipeline Architecture\n\nUnderstanding the architecture of Wurzel pipelines is essential for building effective RAG systems.\n\n## Core Concepts\n\n### TypedStep\n\nThe fundamental building block of Wurzel pipelines. Each TypedStep defines:\n\n- Input data contract (what data it expects)\n- Output data contract (what data it produces)\n- Processing logic (how it transforms the data)\n- Configuration settings (how it can be customized)\n\n### Pipeline Composition\n\nSteps are composed using the `>>` operator:\n\n```python\nsource >> processor >> sink\n```\n\nThis creates a directed acyclic graph (DAG) that DVC can execute efficiently.\n\n### Data Contracts\n\nWurzel uses Pydantic models to define strict data contracts between steps:\n\n- **MarkdownDataContract**: For document content with metadata\n- **EmbeddingResult**: For vectorized text chunks\n- **QdrantResult**: For vector database storage results\n\n## Built-in Steps\n\n### ManualMarkdownStep\n\nLoads markdown files from a specified directory. Configuration:\n\n- `FOLDER_PATH`: Directory containing markdown files\n\n### EmbeddingStep\n\nGenerates vector embeddings for text content. Features:\n\n- Automatic text splitting and chunking\n- Configurable embedding models\n- Batch processing for efficiency\n\n### QdrantConnectorStep\n\nStores embeddings in Qdrant vector database. Capabilities:\n\n- Automatic collection management\n- Index creation and optimization\n- Metadata preservation\n\n## Extension Points\n\nCreate custom steps by inheriting from `TypedStep`:\n\n```python\nclass CustomStep(TypedStep[CustomSettings, InputContract, OutputContract]):\n    def run(self, input_data: InputContract) -> OutputContract:\n        # Your processing logic here\n        return processed_data\n```\n\n## Best Practices\n\n- Keep steps focused on single responsibilities\n- Use type hints for better IDE support and validation\n- Test steps independently before chaining\n- Monitor resource usage for large datasets",
    "keywords": "architecture",
    "url": "ManualMarkdownStep//usr/app/demo-data/architecture.md",
    "metadata": {
      "token_len": 387,
      "char_len": 1895,
      "source_sha256_hash": "f9c2098b67204f39c058860e1a89670a9fa4c054f04a54bbff4ac8f573a646e8",
      "chunk_index": 0,
      "chunks_count": 1
    }
  },
  {
    "md": "# Setting Up Your RAG Pipeline\n\nThis guide walks through the process of setting up a Retrieval-Augmented Generation pipeline using Wurzel.\n\n## Prerequisites\n\nBefore you begin, ensure you have:\n\n- Docker installed on your system\n- Access to a vector database (Qdrant or Milvus)\n- Your documents ready for processing\n\n## Configuration Steps\n\n### Step 1: Prepare Your Documents\n\nPlace your markdown files in the `demo-data` directory. Wurzel will automatically discover and process all `.md` files in this location.\n\n### Step 2: Environment Configuration\n\nSet the following environment variables:\n\n```bash\nexport MANUALMARKDOWNSTEP__FOLDER_PATH=/path/to/your/documents\nexport WURZEL_PIPELINE=your_pipeline:pipeline\n```\n\n### Step 3: Vector Database Setup\n\nConfigure your vector database connection:\n\n- **For Qdrant**: Set `QDRANT__URI` and `QDRANT__APIKEY`\n- **For Milvus**: Set `MILVUS__URI` and connection parameters\n\n### Step 4: Run the Pipeline\n\nExecute your pipeline using Docker Compose:\n\n```bash\ndocker-compose up wurzel-pipeline\n```\n\n## Pipeline Stages\n\n1. **Document Loading**: Read markdown files from the configured directory\n1. **Text Processing**: Clean and split documents into manageable chunks\n1. **Embedding Generation**: Create vector embeddings for text chunks\n1. **Vector Storage**: Store embeddings in your chosen vector database\n\n## Monitoring and Debugging\n\n- Check DVC status for pipeline execution details\n- Review container logs for processing information\n- Use the built-in Git integration to track changes",
    "keywords": "setup-guide",
    "url": "ManualMarkdownStep//usr/app/demo-data/setup-guide.md",
    "metadata": {
      "token_len": 343,
      "char_len": 1509,
      "source_sha256_hash": "d344be37936af8f75933eed843b2b9e9a501a5f1053ae469fe6821c73785ed4e",
      "chunk_index": 0,
      "chunks_count": 1
    }
  }
]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sam-hey sam-hey merged commit 1952dea into main Apr 13, 2026
26 checks passed
@sam-hey sam-hey deleted the fix/argo_prometheus branch April 13, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants