Skip to content

Add SonarQube integration and refactor E2E tests#4

Open
johannhartmann wants to merge 108 commits intomainfrom
sonar-fixes
Open

Add SonarQube integration and refactor E2E tests#4
johannhartmann wants to merge 108 commits intomainfrom
sonar-fixes

Conversation

@johannhartmann
Copy link
Member

Summary

  • Add SonarQube configuration and CI integration
  • Refactor E2E tests for reliability (no LLM output string matching)

Changes

SonarQube Integration

  • Add sonar-project.properties with coverage/junit paths
  • Add flake8-bandit (S) security rules to ruff
  • Update CI workflow with SonarQube scan job on mayflower-k8s-runners
  • Generate SARIF, junit.xml, and coverage.xml for SonarQube ingestion

Test Reliability

  • Replace 11 flaky LLM-dependent tests with 8 deterministic unit tests
  • Tests call tools directly with pre-defined state (no LLM)
  • Assertions check VFS file existence/content, not stdout strings
  • Remove fragile regex parsing of LLM markdown output

Test plan

  • Unit tests pass locally (8/8)
  • CI workflow runs successfully
  • SonarQube receives analysis data

- schema_codegen.py: Filter __future__ imports to avoid syntax errors
- test_skills_install.py: Fix GitHub URL path (skills/algorithmic-art)
- conftest.py: Add DENO_AVAILABLE check and requires_deno marker
- test_mcp_bind.py: Set up module hierarchy for relative imports
- test_tools.py: Update assertions and handle Command return type
Tests should not rely on string matching in LLM responses since this is
inherently unreliable. Updated E2E tests to use deterministic validation:

- Added helper functions: get_tool_messages(), get_tool_stdout(),
  get_created_files(), check_vfs_file_exists(), get_vfs_file_content()
- test_python_run_prepared_*: Check tool stdout for expected output
- test_file_write_*: Verify files exist in VFS with correct content
- Document tests: Verify file creation and magic bytes in VFS
- test_agent_state: Check ToolMessage content instead of final AI message

This makes tests deterministic and not dependent on LLM response formatting.
SonarQube integration:
- Add sonar-project.properties with coverage/junit paths
- Add flake8-bandit (S) security rules to ruff
- Configure pytest to output junit.xml

Test reliability improvements:
- Replace 11 flaky LLM-dependent tests with 8 deterministic unit tests
- Tests now call tools directly with pre-defined state (no LLM)
- Assertions check VFS file existence/content, not stdout strings
- Remove fragile regex parsing of LLM markdown output
- Add optional @pytest.mark.slow integration test for full agent flow
- Add SARIF output for ruff linting
- Add junit.xml and coverage.xml output for pytest
- Add SonarQube scan job on mayflower-k8s-runners
- Upload artifacts between jobs for SonarQube ingestion
- Skip slow tests in CI (marked with @pytest.mark.slow)
- Add .pre-commit-config.yaml with ruff and mypy hooks
- Add pre-commit to dev dependencies
- Add S314 ignore for document helper XML parsing (trusted source)
- Fix formatting in test files
- Replace mypy with Astral ty in pre-commit hooks
- Fix type annotation: date_iso: str | None = None
- Fix deprecated datetime.utcnow() -> datetime.now(timezone.utc)
- Ignore unresolved-import for Pyodide-only modules
- Ignore unknown-argument for langchain (lacks type stubs)
- Fix tests to use POSTGRES_PORT instead of PGPORT (CI uses 5432)
- Run SonarQube job even if tests fail (if: always())
- Fix flaky test: verify VFS files instead of LLM string matching
- Add junit.xml, coverage.xml, ruff-results.sarif to .gitignore
- Update uv.lock with pre-commit dependencies
- Delete debug test files: test_debug_recursion.py, test_debug_stream.py,
  test_pptx_debug.py (no assertions, just print output)
- Delete redundant test_helpers_import.py (covered by test_helper_direct.py)
- Add missing @skipif decorator to test_agent_pdf_with_multiple_pages
- Mark all LLM-based test modules with pytest.mark.slow:
  - test_langgraph_integration.py
  - test_langgraph_realistic.py
  - test_langgraph_skills.py
  - test_document_skills.py
  - test_docx_add_comment.py
  - test_docx_to_markdown.py

Run fast tests with: pytest -m "not slow"
New test file covering:
- Corrupt/malformed Excel files (invalid ZIP, truncated, empty)
- Corrupt/malformed Word documents (invalid data, malformed XML)
- Memory boundary tests (large strings, large lists)

These tests verify the sandbox handles edge cases gracefully
without crashing.
- Convert mypy JSON output to SARIF format via inline Python script
- Upload mypy SARIF as CI artifact for SonarQube ingestion
- Include mypy-results.sarif in dynamic SARIF path collection
- Update sonar.sources to src/mayflower_sandbox for coverage path matching
- Clean up sonar-project.properties comments
- Add coverage.py config with relative_files=true for portable paths
- Revert sonar.sources to 'src' for proper path matching
- Upgrade SonarQube scan action from v5 to v6 (security fix)
- Add Bandit security scanner for Python code
- Add ESLint with TypeScript plugin for TS files
- Configure SonarQube to import external analyzer reports
- Update dependencies (bandit>=1.7.0)

External analyzer reports now sent to SonarQube:
- bandit-results.json: Python security issues
- eslint-results.json: TypeScript linting issues
Enables SonarQube card display in Backstage catalog for this component.
- Change techdocs-ref to dir:. for proper source resolution
- Remove duplicate search plugin (bundled in techdocs-core)
@johannhartmann
Copy link
Member Author

Code Review

Critical Issue Found

File: src/mayflower_sandbox/helpers/document/pptx_ooxml.py
Line: 183
Confidence: 98%

Issue: Incorrect XPath breaks pptx_rearrange() function

The XPath expression .//p:slide_id_list is incorrect. The actual XML element in OpenXML is sldIdLst (camelCase), not slide_id_list (snake_case). This was introduced when refactoring variable names to snake_case, but the XPath string was incorrectly changed along with it.

Current code:

slide_id_list = pres.find(".//p:slide_id_list", NS)  # WRONG

Should be:

slide_id_list = pres.find(".//p:sldIdLst", NS)  # CORRECT

Evidence:

  1. Test file tests/test_pptx_helpers.py line 91 shows correct XML structure: <p:sldIdLst>
  2. OpenXML specification uses camelCase element names
  3. Git history shows this was introduced in commit d39855b when refactoring for SonarQube compliance

Impact:

  • The pptx_rearrange() function will always return the original PPTX unchanged because the element will never be found
  • Any LangGraph agent attempting to reorder PowerPoint slides will fail silently
  • This breaks existing functionality that was working before the refactoring

Automated code review by Claude Code

@johannhartmann johannhartmann force-pushed the sonar-fixes branch 2 times, most recently from 8c864c5 to 803c8f0 Compare January 31, 2026 16:47
Johann-Peter Hartmann and others added 6 commits January 31, 2026 17:49
Revert the techdocs-ref annotation back to url: format and restore
the search plugin while investigating TechDocs issues.
techdocs-core already includes the search plugin, so having both
causes duplicate entries. Consistent with scry configuration.
Phase 2.1 - Extract constants for duplicated literals:
- docx_ooxml.py: Extract _DOCUMENT_XML, _COMMENTS_XML, _XPATH_TEXT
- pptx_ooxml.py: Extract _PRESENTATION_XML, rename sldIdLst to snake_case
- integrations.py: Extract _INIT_PY constant
- execute.py: Extract _EXPLANATION_MARKER, _RECOMMENDATION_MARKER

Phase 2.2 - Fix minor issues:
- bootstrap.py: Remove unused thread_id parameter
- integrations.py: Use \W instead of [^0-9a-zA-Z_] for regex
- sandbox_executor.py, test_maistack_tools.py: Update callers

Phase 1.2 - Fix async issues:
- worker_pool.py: Use contextlib.suppress for CancelledError, store task refs
- cleanup.py: Use contextlib.suppress for CancelledError
- pdf_creation.py: Use run_in_executor for non-blocking file write

Phase 1.1 - Refactor cognitive complexity:
- file_grep.py: Extract _search_*, _format_* helpers, add _process_file method
- file_glob.py: Extract _matches_pattern, _format_file_info helpers
- run_file.py: Extract _format_*, _build_response_message, _try_langgraph_command
- execute_code.py: Extract AguiEventEmitter class, _format_* helpers
- base.py: Extract _get_thread_id_from_metadata, _get_thread_id_from_tags
- /push-watch: Push branch and monitor CI until completion
- /sonar-status: Check quality gate and metrics
- /sonar-fix: Find and fix issues by severity
- /pr-quality: Create PR with quality metrics included
Skills must be in <name>/SKILL.md format, not <name>.md
- Fix type annotation in AguiEventEmitter._writer field
- Replace silent try/except/pass with explanatory comments
- Add defusedxml dependency for secure XML parsing
- Replace xml.etree.ElementTree.fromstring with defusedxml in:
  - docx_ooxml.py (7 instances)
  - pptx_ooxml.py (5 instances)
- Add history_extraction module for extracting fenced code from messages
- Fix shell executor to handle redirections with shared OutputCapture
- Add thisProgram: "busybox" for proper applet invocation
- Update tool tests for history-based code extraction fallback
- Fix test_python_run_prepared_e2e state injection to include messages
- history_extraction.py: Extract _extract_text_from_item and _normalize_list_content helpers
- shell_executor.ts: Extract isExitStatusError, ensureParentDir, writeRedirectContent, handleRedirection helpers
- shell_executor.ts: Extract isSpecialEntry, buildEntryPath, tryReadDir, processEntry helpers
- shell_executor.ts: Use RegExp.exec() instead of String.match()
- shell_executor.ts: Use for-of loop instead of indexed for loop
- deepagents_backend.py: Extract _grep_file_matches and _matches_glob_filter helpers
When pool.start() fails, cls._pool was already set to the failed pool
instance. Subsequent calls would see the non-None pool and skip
re-initialization, leaving the system in a broken state.

Now the pool is only assigned to cls._pool after successful startup.
Replace readline() with chunked read() in _wait_ready() to avoid pipe
buffering issues that caused workers to timeout during initialization.
readline() blocks indefinitely on unbuffered pipe streams, while read()
correctly returns data as it becomes available.
- Add comprehensive unit tests for history_extraction.py (59 tests)
- Add unit tests for deepagents_backend.py (54 tests) with mock for optional dep
- Exclude shell_executor.ts from coverage (TypeScript, not measurable by pytest)
Implement SharedArrayBuffer ring buffer pipes connecting Deno Workers
for proper shell pipe support (echo hello | cat). Each pipeline stage
runs in isolated Worker with VFS files mounted.

- Add PipeWriter/PipeReader classes with Atomics synchronization
- Add executePipeline() for multi-stage command execution
- Pass VFS files to pipeline workers for file access
- Add pipe tests: simple, multi-stage, grep, wc, file operations
johannhartmann and others added 20 commits January 31, 2026 17:58
- Fix XPath typo in pptx_ooxml.py: sldIdLst not slide_id_list
- Add error logging to empty catch blocks in shell Worker code
- Add 30s timeout to pipe reads to prevent infinite loops
- Add timeout to future.result() for cross-thread calls
- Make exception handling more specific in aupload_files
- Add JSON-RPC error response for malformed worker requests
- Remove unused ProcessManager import and function
- Add logging import and debug logs for error paths
- Delete development notes: VFS_FALLBACK_TEST_SPEC.md, testfiles.md, sandbox_changes.md
- Add Shell Executor section to architecture docs
- Document BusyBox WASM, pipe support, Worker-based isolation
- Add shell execution to key features in index.md
- Remove "simple mode" as a feature - it's an optimization for non-pipe commands
- Explain Worker isolation is necessary due to BusyBox global state
- Update code comments to reflect actual architecture
- Remove redundant simple mode code path
- Implement proper output redirection via state-based capture
- Add shell command parsing inside Workers (&&, ;, >, >>)
- Execute applets directly to avoid vfork/waitpid syscalls
- All commands now run in isolated Deno Workers with BusyBox WASM
Replace regex patterns with overlapping \s* quantifiers that could
cause exponential backtracking with indexOf-based string parsing:

- Split on (&&|;) without \s* padding, trim parts afterward
- Use extractFirstToken() helper instead of /^\S+/ regex
- Parse redirections (>, >>, <) using lastIndexOf/indexOf

This eliminates denial-of-service risk from malicious input strings.
- Add DeepAgents integration and shell execution to key features
- Update architecture diagram to show both LangGraph and DeepAgents paths
- Add ShellExecutor and BusyBox WASM to architecture components
- Add new DeepAgents Backend section with usage examples
- Document shell execution features (pipes, chaining, redirections)
- Update security section to mention WASM and Worker-based isolation
- Add DeepAgents and BusyBox to related projects
Add table listing all protocol methods with their async versions
and descriptions. Update usage examples to show all method types:
execute, file operations, and search operations.
Add comparison table showing which features are available in each API.
LangChain tools have skills/MCP/state-based execution; DeepAgents
backend has shell execution and batch operations. Neither is a
superset of the other.
MayflowerSandboxBackend now returns proper files_update dict in WriteResult
and EditResult, matching DeepAgents StateBackend format. This allows the
FilesystemMiddleware to populate the 'files' state field, enabling the
frontend to display files in the Files view.
- Remove entire src/mayflower_sandbox/tools/ directory (12 LangChain tool files)
- Remove tools_skills_mcp.py and agent_state.py
- Add fallback dataclass types in deepagents_backend.py for testing without deepagents
- Add create_docx_bytes function to docx_ooxml.py for document creation
- Update conftest.py with deepagents mocking fixture
- Convert docx and document skills tests to use MayflowerSandboxBackend directly
- Add Python script detection tests to test_deepagents_backend.py
- Remove all LangChain-related test files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract PostgresBackend from MayflowerSandboxBackend as standalone class
- PostgresBackend implements BackendProtocol (file ops only, no execution)
- MayflowerSandboxBackend extends PostgresBackend with SandboxBackendProtocol
- PostgresBackend can be used standalone or with CompositeBackend for routing
- Export PostgresBackend from package __init__.py
- Add BackendProtocol to test mocks
- Fix datetime.utcnow() deprecation warning

Example usage:
  # Standalone file storage
  postgres = PostgresBackend(db_pool, thread_id)

  # With CompositeBackend routing
  composite = CompositeBackend(
      default=StateBackend(runtime),
      routes={"/memories/": postgres}
  )

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- 10 tests for PostgresBackend (BackendProtocol): write, read, edit,
  ls_info, glob_info, grep_raw, upload_files, download_files
- 9 tests for MayflowerSandboxBackend (SandboxBackendProtocol): shell
  execution via BusyBox, Python execution via Pyodide worker pool,
  file I/O with VFS sync, id property, inheritance verification
- Tests use real PostgreSQL from docker-compose (port 5433)
- Tests use real Deno worker pool for Pyodide execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SECURITY.md: vulnerability disclosure policy with 24h SLA, safe harbor
- SUPPORT.md: rolling support model, update channels (PyPI, GitHub Releases)
- CONTRIBUTING.md: secure development policy, mandatory CI gates
- CHANGELOG.md: Keep a Changelog format with [SECURITY] tagging convention
- .github/dependabot.yml: weekly pip + github-actions dependency scanning
- .github/workflows/release.yml: SBOM generation (CycloneDX + Syft) on tags
- .github/workflows/ci.yml: add SBOM validation step to quality checks
- README.md: add Bandit and CycloneDX badges
The build and cyclonedx-bom packages are installed in the uv venv,
so commands must run via uv run to use the correct Python environment.
Use -o (not --output) for output file and --of JSON (uppercase).
Add --mc-type library and --pyproject for proper SBOM metadata.
…te routing

MayflowerSandboxBackend._parse_python_command() only matched `python script.py`
patterns, causing `python -c "..."` inline execution and the __PYTHON__ sentinel
(used by ToolCallContentMiddleware) to fall through to BusyBox shell, which has
no Python binary.

Add _extract_inline_python() static method and extend execute()/aexecute() to
check for sentinel prefix and python -c patterns before file-based routing.
This eliminates the need for external monkey-patches.
- Remove B904, B905, B008, N806 from global ruff ignores; add `from err`
  to bare raises, `strict=False` to zip calls
- Replace blanket mypy ignore_errors with targeted ignore_missing_imports
  for helpers; fix 7 real type errors in helpers (None checks, no-redef,
  arg-type) and 23 in tests (asyncpg overload matching, attr-defined,
  operator)
- Refactor worker_pool.py: replace asyncio.StreamReader._buffer hack with
  external read buffer; extract typed params dict to remove type: ignore
- Merge duplicate ruff per-file-ignores (tests/* + **/test_*.py → tests/**)
- Trim SonarQube coverage exclusions from 13 to 6 entries, exposing ~1,872
  lines of production code for coverage tracking
…onal

- Add if: always() to test results upload step so artifacts are available
  for SonarQube even when tests fail
- Make deno-coverage.xml path conditional in SonarQube scan to prevent
  hard failure when the file is missing
…/home

snapshotFiles previously only watched /tmp and /home, missing files
written to / (e.g. fibonacci_spiral.png from matplotlib). Changed to
snapshot from root with system path filtering (/lib, /proc, /dev, etc.)
to catch all user-created files while skipping Pyodide internals.
After Python code execution, files detected by snapshotFiles (e.g.
matplotlib PNGs) are now stored as pending files_update keyed by
thread_id. The tool node can consume these via the classmethod
consume_pending_files_update() to inject into state.files, making
them visible in the frontend Files panel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant