Add SonarQube integration and refactor E2E tests by johannhartmann · Pull Request #4 · mayflower/mayflower_sandbox

johannhartmann · 2026-01-21T11:27:12Z

Summary

Add SonarQube configuration and CI integration
Refactor E2E tests for reliability (no LLM output string matching)

Changes

SonarQube Integration

Add sonar-project.properties with coverage/junit paths
Add flake8-bandit (S) security rules to ruff
Update CI workflow with SonarQube scan job on mayflower-k8s-runners
Generate SARIF, junit.xml, and coverage.xml for SonarQube ingestion

Test Reliability

Replace 11 flaky LLM-dependent tests with 8 deterministic unit tests
Tests call tools directly with pre-defined state (no LLM)
Assertions check VFS file existence/content, not stdout strings
Remove fragile regex parsing of LLM markdown output

Test plan

Unit tests pass locally (8/8)
CI workflow runs successfully
SonarQube receives analysis data

- schema_codegen.py: Filter __future__ imports to avoid syntax errors - test_skills_install.py: Fix GitHub URL path (skills/algorithmic-art) - conftest.py: Add DENO_AVAILABLE check and requires_deno marker - test_mcp_bind.py: Set up module hierarchy for relative imports - test_tools.py: Update assertions and handle Command return type

Tests should not rely on string matching in LLM responses since this is inherently unreliable. Updated E2E tests to use deterministic validation: - Added helper functions: get_tool_messages(), get_tool_stdout(), get_created_files(), check_vfs_file_exists(), get_vfs_file_content() - test_python_run_prepared_*: Check tool stdout for expected output - test_file_write_*: Verify files exist in VFS with correct content - Document tests: Verify file creation and magic bytes in VFS - test_agent_state: Check ToolMessage content instead of final AI message This makes tests deterministic and not dependent on LLM response formatting.

SonarQube integration: - Add sonar-project.properties with coverage/junit paths - Add flake8-bandit (S) security rules to ruff - Configure pytest to output junit.xml Test reliability improvements: - Replace 11 flaky LLM-dependent tests with 8 deterministic unit tests - Tests now call tools directly with pre-defined state (no LLM) - Assertions check VFS file existence/content, not stdout strings - Remove fragile regex parsing of LLM markdown output - Add optional @pytest.mark.slow integration test for full agent flow

- Add SARIF output for ruff linting - Add junit.xml and coverage.xml output for pytest - Add SonarQube scan job on mayflower-k8s-runners - Upload artifacts between jobs for SonarQube ingestion - Skip slow tests in CI (marked with @pytest.mark.slow)

- Add .pre-commit-config.yaml with ruff and mypy hooks - Add pre-commit to dev dependencies - Add S314 ignore for document helper XML parsing (trusted source) - Fix formatting in test files

- Replace mypy with Astral ty in pre-commit hooks - Fix type annotation: date_iso: str | None = None - Fix deprecated datetime.utcnow() -> datetime.now(timezone.utc) - Ignore unresolved-import for Pyodide-only modules - Ignore unknown-argument for langchain (lacks type stubs)

- Fix tests to use POSTGRES_PORT instead of PGPORT (CI uses 5432) - Run SonarQube job even if tests fail (if: always()) - Fix flaky test: verify VFS files instead of LLM string matching

- Add junit.xml, coverage.xml, ruff-results.sarif to .gitignore - Update uv.lock with pre-commit dependencies

- Delete debug test files: test_debug_recursion.py, test_debug_stream.py, test_pptx_debug.py (no assertions, just print output) - Delete redundant test_helpers_import.py (covered by test_helper_direct.py) - Add missing @skipif decorator to test_agent_pdf_with_multiple_pages - Mark all LLM-based test modules with pytest.mark.slow: - test_langgraph_integration.py - test_langgraph_realistic.py - test_langgraph_skills.py - test_document_skills.py - test_docx_add_comment.py - test_docx_to_markdown.py Run fast tests with: pytest -m "not slow"

New test file covering: - Corrupt/malformed Excel files (invalid ZIP, truncated, empty) - Corrupt/malformed Word documents (invalid data, malformed XML) - Memory boundary tests (large strings, large lists) These tests verify the sandbox handles edge cases gracefully without crashing.

- Convert mypy JSON output to SARIF format via inline Python script - Upload mypy SARIF as CI artifact for SonarQube ingestion - Include mypy-results.sarif in dynamic SARIF path collection - Update sonar.sources to src/mayflower_sandbox for coverage path matching - Clean up sonar-project.properties comments

- Add coverage.py config with relative_files=true for portable paths - Revert sonar.sources to 'src' for proper path matching - Upgrade SonarQube scan action from v5 to v6 (security fix)

- Add Bandit security scanner for Python code - Add ESLint with TypeScript plugin for TS files - Configure SonarQube to import external analyzer reports - Update dependencies (bandit>=1.7.0) External analyzer reports now sent to SonarQube: - bandit-results.json: Python security issues - eslint-results.json: TypeScript linting issues

Enables SonarQube card display in Backstage catalog for this component.

- Change techdocs-ref to dir:. for proper source resolution - Remove duplicate search plugin (bundled in techdocs-core)

johannhartmann · 2026-01-24T22:01:41Z

Code Review

Critical Issue Found

File: src/mayflower_sandbox/helpers/document/pptx_ooxml.py
Line: 183
Confidence: 98%

Issue: Incorrect XPath breaks `pptx_rearrange()` function

The XPath expression .//p:slide_id_list is incorrect. The actual XML element in OpenXML is sldIdLst (camelCase), not slide_id_list (snake_case). This was introduced when refactoring variable names to snake_case, but the XPath string was incorrectly changed along with it.

Current code:

slide_id_list = pres.find(".//p:slide_id_list", NS)  # WRONG

Should be:

slide_id_list = pres.find(".//p:sldIdLst", NS)  # CORRECT

Evidence:

Test file tests/test_pptx_helpers.py line 91 shows correct XML structure: <p:sldIdLst>
OpenXML specification uses camelCase element names
Git history shows this was introduced in commit d39855b when refactoring for SonarQube compliance

Impact:

The pptx_rearrange() function will always return the original PPTX unchanged because the element will never be found
Any LangGraph agent attempting to reorder PowerPoint slides will fail silently
This breaks existing functionality that was working before the refactoring

Automated code review by Claude Code

Revert the techdocs-ref annotation back to url: format and restore the search plugin while investigating TechDocs issues.

techdocs-core already includes the search plugin, so having both causes duplicate entries. Consistent with scry configuration.

Phase 2.1 - Extract constants for duplicated literals: - docx_ooxml.py: Extract _DOCUMENT_XML, _COMMENTS_XML, _XPATH_TEXT - pptx_ooxml.py: Extract _PRESENTATION_XML, rename sldIdLst to snake_case - integrations.py: Extract _INIT_PY constant - execute.py: Extract _EXPLANATION_MARKER, _RECOMMENDATION_MARKER Phase 2.2 - Fix minor issues: - bootstrap.py: Remove unused thread_id parameter - integrations.py: Use \W instead of [^0-9a-zA-Z_] for regex - sandbox_executor.py, test_maistack_tools.py: Update callers Phase 1.2 - Fix async issues: - worker_pool.py: Use contextlib.suppress for CancelledError, store task refs - cleanup.py: Use contextlib.suppress for CancelledError - pdf_creation.py: Use run_in_executor for non-blocking file write Phase 1.1 - Refactor cognitive complexity: - file_grep.py: Extract _search_*, _format_* helpers, add _process_file method - file_glob.py: Extract _matches_pattern, _format_file_info helpers - run_file.py: Extract _format_*, _build_response_message, _try_langgraph_command - execute_code.py: Extract AguiEventEmitter class, _format_* helpers - base.py: Extract _get_thread_id_from_metadata, _get_thread_id_from_tags

- /push-watch: Push branch and monitor CI until completion - /sonar-status: Check quality gate and metrics - /sonar-fix: Find and fix issues by severity - /pr-quality: Create PR with quality metrics included

Skills must be in <name>/SKILL.md format, not <name>.md

- Fix type annotation in AguiEventEmitter._writer field - Replace silent try/except/pass with explanatory comments - Add defusedxml dependency for secure XML parsing - Replace xml.etree.ElementTree.fromstring with defusedxml in: - docx_ooxml.py (7 instances) - pptx_ooxml.py (5 instances)

- Add history_extraction module for extracting fenced code from messages - Fix shell executor to handle redirections with shared OutputCapture - Add thisProgram: "busybox" for proper applet invocation - Update tool tests for history-based code extraction fallback - Fix test_python_run_prepared_e2e state injection to include messages

- history_extraction.py: Extract _extract_text_from_item and _normalize_list_content helpers - shell_executor.ts: Extract isExitStatusError, ensureParentDir, writeRedirectContent, handleRedirection helpers - shell_executor.ts: Extract isSpecialEntry, buildEntryPath, tryReadDir, processEntry helpers - shell_executor.ts: Use RegExp.exec() instead of String.match() - shell_executor.ts: Use for-of loop instead of indexed for loop - deepagents_backend.py: Extract _grep_file_matches and _matches_glob_filter helpers

When pool.start() fails, cls._pool was already set to the failed pool instance. Subsequent calls would see the non-None pool and skip re-initialization, leaving the system in a broken state. Now the pool is only assigned to cls._pool after successful startup.

Replace readline() with chunked read() in _wait_ready() to avoid pipe buffering issues that caused workers to timeout during initialization. readline() blocks indefinitely on unbuffered pipe streams, while read() correctly returns data as it becomes available.

- Add comprehensive unit tests for history_extraction.py (59 tests) - Add unit tests for deepagents_backend.py (54 tests) with mock for optional dep - Exclude shell_executor.ts from coverage (TypeScript, not measurable by pytest)

Implement SharedArrayBuffer ring buffer pipes connecting Deno Workers for proper shell pipe support (echo hello | cat). Each pipeline stage runs in isolated Worker with VFS files mounted. - Add PipeWriter/PipeReader classes with Atomics synchronization - Add executePipeline() for multi-stage command execution - Pass VFS files to pipeline workers for file access - Add pipe tests: simple, multi-stage, grep, wc, file operations

- Fix XPath typo in pptx_ooxml.py: sldIdLst not slide_id_list - Add error logging to empty catch blocks in shell Worker code - Add 30s timeout to pipe reads to prevent infinite loops - Add timeout to future.result() for cross-thread calls - Make exception handling more specific in aupload_files - Add JSON-RPC error response for malformed worker requests - Remove unused ProcessManager import and function - Add logging import and debug logs for error paths

- Delete development notes: VFS_FALLBACK_TEST_SPEC.md, testfiles.md, sandbox_changes.md - Add Shell Executor section to architecture docs - Document BusyBox WASM, pipe support, Worker-based isolation - Add shell execution to key features in index.md

- Remove "simple mode" as a feature - it's an optimization for non-pipe commands - Explain Worker isolation is necessary due to BusyBox global state - Update code comments to reflect actual architecture

- Remove redundant simple mode code path - Implement proper output redirection via state-based capture - Add shell command parsing inside Workers (&&, ;, >, >>) - Execute applets directly to avoid vfork/waitpid syscalls - All commands now run in isolated Deno Workers with BusyBox WASM

Replace regex patterns with overlapping \s* quantifiers that could cause exponential backtracking with indexOf-based string parsing: - Split on (&&|;) without \s* padding, trim parts afterward - Use extractFirstToken() helper instead of /^\S+/ regex - Parse redirections (>, >>, <) using lastIndexOf/indexOf This eliminates denial-of-service risk from malicious input strings.

- Add DeepAgents integration and shell execution to key features - Update architecture diagram to show both LangGraph and DeepAgents paths - Add ShellExecutor and BusyBox WASM to architecture components - Add new DeepAgents Backend section with usage examples - Document shell execution features (pipes, chaining, redirections) - Update security section to mention WASM and Worker-based isolation - Add DeepAgents and BusyBox to related projects

Add table listing all protocol methods with their async versions and descriptions. Update usage examples to show all method types: execute, file operations, and search operations.

Add comparison table showing which features are available in each API. LangChain tools have skills/MCP/state-based execution; DeepAgents backend has shell execution and batch operations. Neither is a superset of the other.

MayflowerSandboxBackend now returns proper files_update dict in WriteResult and EditResult, matching DeepAgents StateBackend format. This allows the FilesystemMiddleware to populate the 'files' state field, enabling the frontend to display files in the Files view.

- Remove entire src/mayflower_sandbox/tools/ directory (12 LangChain tool files) - Remove tools_skills_mcp.py and agent_state.py - Add fallback dataclass types in deepagents_backend.py for testing without deepagents - Add create_docx_bytes function to docx_ooxml.py for document creation - Update conftest.py with deepagents mocking fixture - Convert docx and document skills tests to use MayflowerSandboxBackend directly - Add Python script detection tests to test_deepagents_backend.py - Remove all LangChain-related test files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Extract PostgresBackend from MayflowerSandboxBackend as standalone class - PostgresBackend implements BackendProtocol (file ops only, no execution) - MayflowerSandboxBackend extends PostgresBackend with SandboxBackendProtocol - PostgresBackend can be used standalone or with CompositeBackend for routing - Export PostgresBackend from package __init__.py - Add BackendProtocol to test mocks - Fix datetime.utcnow() deprecation warning Example usage: # Standalone file storage postgres = PostgresBackend(db_pool, thread_id) # With CompositeBackend routing composite = CompositeBackend( default=StateBackend(runtime), routes={"/memories/": postgres} ) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- 10 tests for PostgresBackend (BackendProtocol): write, read, edit, ls_info, glob_info, grep_raw, upload_files, download_files - 9 tests for MayflowerSandboxBackend (SandboxBackendProtocol): shell execution via BusyBox, Python execution via Pyodide worker pool, file I/O with VFS sync, id property, inheritance verification - Tests use real PostgreSQL from docker-compose (port 5433) - Tests use real Deno worker pool for Pyodide execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- SECURITY.md: vulnerability disclosure policy with 24h SLA, safe harbor - SUPPORT.md: rolling support model, update channels (PyPI, GitHub Releases) - CONTRIBUTING.md: secure development policy, mandatory CI gates - CHANGELOG.md: Keep a Changelog format with [SECURITY] tagging convention - .github/dependabot.yml: weekly pip + github-actions dependency scanning - .github/workflows/release.yml: SBOM generation (CycloneDX + Syft) on tags - .github/workflows/ci.yml: add SBOM validation step to quality checks - README.md: add Bandit and CycloneDX badges

The build and cyclonedx-bom packages are installed in the uv venv, so commands must run via uv run to use the correct Python environment.

Use -o (not --output) for output file and --of JSON (uppercase). Add --mc-type library and --pyproject for proper SBOM metadata.

…te routing MayflowerSandboxBackend._parse_python_command() only matched `python script.py` patterns, causing `python -c "..."` inline execution and the __PYTHON__ sentinel (used by ToolCallContentMiddleware) to fall through to BusyBox shell, which has no Python binary. Add _extract_inline_python() static method and extend execute()/aexecute() to check for sentinel prefix and python -c patterns before file-based routing. This eliminates the need for external monkey-patches.

- Remove B904, B905, B008, N806 from global ruff ignores; add `from err` to bare raises, `strict=False` to zip calls - Replace blanket mypy ignore_errors with targeted ignore_missing_imports for helpers; fix 7 real type errors in helpers (None checks, no-redef, arg-type) and 23 in tests (asyncpg overload matching, attr-defined, operator) - Refactor worker_pool.py: replace asyncio.StreamReader._buffer hack with external read buffer; extract typed params dict to remove type: ignore - Merge duplicate ruff per-file-ignores (tests/* + **/test_*.py → tests/**) - Trim SonarQube coverage exclusions from 13 to 6 entries, exposing ~1,872 lines of production code for coverage tracking

…onal - Add if: always() to test results upload step so artifacts are available for SonarQube even when tests fail - Make deno-coverage.xml path conditional in SonarQube scan to prevent hard failure when the file is missing

…/home snapshotFiles previously only watched /tmp and /home, missing files written to / (e.g. fibonacci_spiral.png from matplotlib). Changed to snapshot from root with system path filtering (/lib, /proc, /dev, etc.) to catch all user-created files while skipping Pyodide internals.

After Python code execution, files detected by snapshotFiles (e.g. matplotlib PNGs) are now stored as pending files_update keyed by thread_id. The tool node can consume these via the classmethod consume_pending_files_update() to inject into state.files, making them visible in the frontend Files panel.

johannhartmann added 21 commits January 20, 2026 17:38

Fix test failures: update tool count, FileWriteTool API, schema codegen

942eb59

Add pre-commit hooks for early quality feedback

2b6b7a4

- Add .pre-commit-config.yaml with ruff and mypy hooks - Add pre-commit to dev dependencies - Add S314 ignore for document helper XML parsing (trusted source) - Fix formatting in test files

Fix CI: use POSTGRES_PORT env var and run SonarQube always

60952cc

- Fix tests to use POSTGRES_PORT instead of PGPORT (CI uses 5432) - Run SonarQube job even if tests fail (if: always()) - Fix flaky test: verify VFS files instead of LLM string matching

Simplify CI: test only Python 3.12

3df5bd3

Update uv.lock and gitignore test artifacts

366a2da

- Add junit.xml, coverage.xml, ruff-results.sarif to .gitignore - Update uv.lock with pre-commit dependencies

Fix CI: run all database migrations

3dcf322

Fix SonarQube project key to match existing project

2cc8585

Fix SonarQube project key to exact value

684dd03

Fix coverage.xml path resolution for SonarQube

374259c

- Add coverage.py config with relative_files=true for portable paths - Revert sonar.sources to 'src' for proper path matching - Upgrade SonarQube scan action from v5 to v6 (security fix)

Fix coverage source path to match sonar.sources

f93747a

Add SonarQube project key annotation for Backstage integration

38b3430

Enables SonarQube card display in Backstage catalog for this component.

Fix TechDocs configuration for Backstage

3dbc0aa

- Change techdocs-ref to dir:. for proper source resolution - Remove duplicate search plugin (bundled in techdocs-core)

johannhartmann force-pushed the sonar-fixes branch 2 times, most recently from 8c864c5 to 803c8f0 Compare January 31, 2026 16:47

Johann-Peter Hartmann and others added 6 commits January 31, 2026 17:49

Revert TechDocs configuration to url: format

f1c489b

Revert the techdocs-ref annotation back to url: format and restore the search plugin while investigating TechDocs issues.

Remove duplicate search plugin from mkdocs.yml

f61ab2e

techdocs-core already includes the search plugin, so having both causes duplicate entries. Consistent with scry configuration.

Add GitHub + SonarQube workflow skills

3c32e00

- /push-watch: Push branch and monitor CI until completion - /sonar-status: Check quality gate and metrics - /sonar-fix: Find and fix issues by severity - /pr-quality: Create PR with quality metrics included

Fix skill directory structure for proper discovery

941bf1a

Skills must be in <name>/SKILL.md format, not <name>.md

johannhartmann added 9 commits January 31, 2026 17:49

Fix async deepagents backend execution

9073afb

Use deno.json for sandbox executors

5ae0baa

Include Deno config files in package

5db09b8

Add unit tests and update SonarQube coverage exclusions

0d2987e

- Add comprehensive unit tests for history_extraction.py (59 tests) - Add unit tests for deepagents_backend.py (54 tests) with mock for optional dep - Exclude shell_executor.ts from coverage (TypeScript, not measurable by pytest)

johannhartmann force-pushed the sonar-fixes branch from 803c8f0 to d261dd5 Compare January 31, 2026 16:49

johannhartmann and others added 20 commits January 31, 2026 17:58

Clarify shell executor architecture in docs and code

6010a71

- Remove "simple mode" as a feature - it's an optimization for non-pipe commands - Explain Worker isolation is necessary due to BusyBox global state - Update code comments to reflect actual architecture

Clarify SandboxBackendProtocol methods in README

807d172

Add table listing all protocol methods with their async versions and descriptions. Update usage examples to show all method types: execute, file operations, and search operations.

fix: use uv run for build and cyclonedx-py in release workflow

4649053

The build and cyclonedx-bom packages are installed in the uv venv, so commands must run via uv run to use the correct Python environment.

fix: correct cyclonedx-py CLI flags in release and CI workflows

d99a045

Use -o (not --output) for output file and --of JSON (uppercase). Add --mc-type library and --pyproject for proper SBOM metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SonarQube integration and refactor E2E tests#4

Add SonarQube integration and refactor E2E tests#4
johannhartmann wants to merge 108 commits intomainfrom
sonar-fixes

johannhartmann commented Jan 21, 2026

Uh oh!

johannhartmann commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johannhartmann commented Jan 21, 2026

Summary

Changes

SonarQube Integration

Test Reliability

Test plan

Uh oh!

johannhartmann commented Jan 24, 2026

Code Review

Critical Issue Found

Issue: Incorrect XPath breaks pptx_rearrange() function

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Issue: Incorrect XPath breaks `pptx_rearrange()` function