Skip to content

Fix OWL generation to include nested schema modules#52

Merged
cmungall merged 1 commit intomainfrom
fix-owl-and-merge-hierarchy
Feb 21, 2026
Merged

Fix OWL generation to include nested schema modules#52
cmungall merged 1 commit intomainfrom
fix-owl-and-merge-hierarchy

Conversation

@cmungall
Copy link
Member

Summary

The merge-hierarchy script (scripts/merge_enums_hierarchy.py) used glob('*.yaml') to find schemas within domain directories, which only searches one level deep. This missed 18 schemas in nested subdirectories:

  • energy/nuclear/*.yaml (fusion, nuclear_cleanup, nuclear_forensics, etc.)
  • energy/renewable/*.yaml (hydrogen, geothermal, bioenergy)
  • medical/oncology/*.yaml, medical/pediatric_oncology/**/*.yaml

Changed to rglob('*.yaml') to recurse into subdirectories.

Before: 675 concrete enums in merged hierarchy
After: 693 concrete enums

The regenerated OWL now includes all Genesis enums (AquiferType, FusionConfinementType, etc.) that were previously missing from https://w3id.org/valuesets/valuesets.owl.ttl.

Test plan

  • just merge-hierarchy now finds 693 concrete enums (was 675)
  • grep AquiferType project/owl/valuesets.owl.ttl returns results
  • After merge, verify curl -L -s https://w3id.org/valuesets/valuesets.owl.ttl | grep AquiferType returns content

🤖 Generated with Claude Code

The merge script used glob('*.yaml') which only found schemas one
level deep within domain directories, missing nested schemas like
energy/nuclear/*.yaml and medical/pediatric_oncology/*.yaml (18
schemas total). Changed to rglob('*.yaml') to recurse.

Regenerated OWL now includes all Genesis enums (AquiferType, etc.)
and other previously missing nested schemas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 21, 2026 22:39
@cmungall cmungall merged commit 80c4a29 into main Feb 21, 2026
7 checks passed
@claude
Copy link
Contributor

claude bot commented Feb 21, 2026

Claude finished @cmungall's task —— View job


PR Review Complete ✅

I've thoroughly reviewed PR #52: "Fix OWL generation to include nested schema modules". Here's my comprehensive feedback:

Todo List:

  • Read and analyze the Python script changes
  • Review the repository structure to understand nested schema organization
  • Check for test coverage of the merge-hierarchy functionality
  • Evaluate code quality and best practices
  • Assess performance implications
  • Check for security considerations
  • Provide comprehensive feedback

Code Quality & Best Practices ✅

The Fix is Excellent:
The change from glob('*.yaml') to rglob('*.yaml') on line 53 in scripts/merge_enums_hierarchy.py is exactly the right fix:

# Before: Only searched one level deep
for yaml_file in domain_dir.glob('*.yaml'):

# After: Recursively searches all subdirectories  
for yaml_file in domain_dir.rglob('*.yaml'):

Code Quality Assessment:

  • Minimal, targeted change: Only one line modified, following the principle of least change
  • Semantically correct: rglob() is the standard Python method for recursive globbing
  • Consistent with codebase: The script already uses Path objects from pathlib, making this change idiomatic
  • Self-documenting: The method name clearly indicates recursive behavior

Impact Analysis 📊

Before vs After Results:

  • Before: 675 concrete enums (missed 18 schemas in nested directories)
  • After: 693 concrete enums (found all schemas including nested ones)
  • Newly Found: Enums like AquiferType, FusionConfinementType from nested directories

Verified Missing Schemas Found:

  • energy/nuclear/*.yaml (11 files: fusion, nuclear_cleanup, nuclear_forensics, etc.)
  • energy/renewable/*.yaml (3 files: hydrogen, geothermal, bioenergy)
  • medical/oncology/*.yaml & medical/pediatric_oncology/**/*.yaml (4+ files)

Performance Considerations ✅

Performance Impact: Minimal

  • rglob() has acceptable performance overhead for this use case
  • The schema directory structure is relatively shallow (2-3 levels max)
  • File count is manageable (~200 YAML files total)
  • This is a build-time script, not runtime-critical
  • Recommendation: Current approach is appropriate; no optimization needed

Security Considerations ✅

No Security Concerns:

  • ✅ Uses pathlib with proper path handling
  • ✅ Maintains existing directory exclusions (merged directory)
  • ✅ No new file access patterns introduced
  • ✅ Continues to skip meta files appropriately
  • ✅ No external input processing changes

Test Coverage Gaps ⚠️

Missing Test Coverage:
The most significant issue is the complete lack of tests for the merge_enums_hierarchy.py script:

  • No unit tests for the collect_all_schemas() function
  • No integration tests for the full merge process
  • No regression tests to prevent this type of issue from recurring

Recommended Test Strategy:

def test_collect_all_schemas_finds_nested():
    """Test that rglob finds schemas in nested directories"""
    # Should find energy/nuclear/*.yaml and energy/renewable/*.yaml
    
def test_schema_count_matches_expected():
    """Test that we find the expected number of schemas"""
    # Should find 693 concrete enums, not 675

Additional Observations 🔍

Documentation:

  • The PR description excellently documents the before/after enum counts
  • Good validation approach using grep AquiferType to verify the fix

Integration:

  • Well-integrated with the existing just merge-hierarchy workflow
  • Properly triggers OWL regeneration via just gen-owl

Recommendations for Future 📝

  1. Add Test Coverage: Implement unit tests for collect_all_schemas() to prevent regression
  2. Consider CI Integration: Add automated validation that expected schema count is maintained
  3. Directory Structure Documentation: Document the nested schema organization pattern in CLAUDE.md

Overall Assessment 🌟

This is a high-quality, surgical fix that addresses the exact problem with minimal risk. The change is:

  • Correct: Solves the stated problem completely
  • Safe: No side effects or breaking changes
  • Efficient: Minimal performance impact
  • Maintainable: Uses standard library methods appropriately

Score: 9/10 (only losing one point for lack of test coverage)


Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incomplete OWL/value set enum generation by ensuring the merge-hierarchy script discovers YAML schema modules nested under domain subdirectories (e.g., energy/nuclear/*, medical/pediatric_oncology/**), rather than only scanning one level deep.

Changes:

  • Update schema discovery in collect_all_schemas() from glob('*.yaml') to recursive rglob('*.yaml') for domain directories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants