Skip to content

docs: add json output and search extensions#1588

Merged
miyoungc merged 6 commits intodevelopfrom
miyoungc/docs-metadata
Jan 28, 2026
Merged

docs: add json output and search extensions#1588
miyoungc merged 6 commits intodevelopfrom
miyoungc/docs-metadata

Conversation

@miyoungc
Copy link
Collaborator

@miyoungc miyoungc commented Jan 16, 2026

Description

Add JSON output extension, and also apply metadata-fed search interface extension from DORI built by @lbliii

Screenshot 2026-01-16 at 12 07 41 PM

Related Issue(s)

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@github-actions
Copy link
Contributor

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1588

@codecov
Copy link

codecov bot commented Jan 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile Summary

  • Adds a comprehensive JSON output extension to the Sphinx documentation system, enabling structured data export alongside HTML output for search indexing and programmatic access
  • Implements a modular architecture with components for document discovery, content extraction, metadata processing, caching, and JSON formatting
  • Introduces configuration support in docs/conf.py to enable the extension with configurable settings for parallel processing, content filtering, and performance optimization

Important Files Changed

Filename Overview
docs/_extensions/json_output/processing/processor.py New parallel processing system for document processing with potential thread safety issues and exception handling that may mask errors
docs/_extensions/json_output/core/json_writer.py New JSONWriter class with type annotation mismatch where dict type is expected but list is passed
docs/_extensions/json_output/core/document_discovery.py New document discovery system with potential circular dependency risks between DocumentDiscovery and JSONOutputBuilder classes
docs/_extensions/json_output/utils.py New utility functions with exclude pattern matching using string prefix instead of proper pattern matching
docs/_extensions/json_output/content/text.py New comprehensive text extraction module with complex regex-based cleaning and keyword extraction for LLM consumption

Confidence score: 3/5

  • This PR introduces significant new functionality but contains several implementation issues that could cause problems in production
  • Score lowered due to thread safety concerns in parallel processing, type annotation mismatches, potential circular dependencies, and incomplete pattern matching logic
  • Pay close attention to docs/_extensions/json_output/processing/processor.py and docs/_extensions/json_output/core/json_writer.py for critical issues that need review

Sequence Diagram

sequenceDiagram
    participant User
    participant "Sphinx Builder"
    participant "JSONOutputBuilder"
    participant "DocumentDiscovery"
    participant "JSONFormatter"
    participant "JSONWriter"
    participant "Content Extractor"
    participant "File System"

    User->>Sphinx Builder: "Build documentation"
    Sphinx Builder->>JSONOutputBuilder: "on_build_finished()"
    JSONOutputBuilder->>DocumentDiscovery: "get_all_documents_recursive()"
    DocumentDiscovery-->>JSONOutputBuilder: "filtered document list"
    
    loop "For each document"
        JSONOutputBuilder->>Content Extractor: "extract_document_content(docname)"
        Content Extractor->>File System: "read source file"
        File System-->>Content Extractor: "raw content"
        Content Extractor-->>JSONOutputBuilder: "extracted content data"
        
        JSONOutputBuilder->>JSONFormatter: "build_json_data(docname)"
        JSONFormatter-->>JSONOutputBuilder: "structured JSON data"
        
        JSONOutputBuilder->>JSONWriter: "write_json_file(docname, data)"
        JSONWriter->>File System: "write JSON file"
        File System-->>JSONWriter: "file written"
    end
    
    JSONOutputBuilder-->>Sphinx Builder: "JSON generation complete"
    Sphinx Builder-->>User: "Documentation built with JSON output"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 20 comments

Edit Code Review Agent Settings | Greptile

miyoungc and others added 3 commits January 16, 2026 11:59
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
@miyoungc miyoungc changed the title docs: add json output extension docs: add json output and search extensions Jan 16, 2026
Copy link
Collaborator

@tgasser-nv tgasser-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built it locally with make docs-serve and the new search looks much better! I have a few questions:

  1. Could you address the Greptile feedback before merging?
  2. Are there any tests for the Python code that verify it works correctly?

Thanks!

@miyoungc
Copy link
Collaborator Author

I built it locally with make docs-serve and the new search looks much better! I have a few questions:

  1. Could you address the Greptile feedback before merging?
  2. Are there any tests for the Python code that verify it works correctly?

Thanks!

Thank you!
I'll look in to #1.
For #2, I'd say the doc build itself would be the test. @lbliii do you know of any tests that could be added for these extensions?

@lbliii
Copy link

lbliii commented Jan 27, 2026

we can look into adding some tests, but for now there aren't any for these extensions. Ideally, this extension code would be living in the theme bundle itself... this is kind of a workaround haha.

@tgasser-nv
Copy link
Collaborator

we can look into adding some tests, but for now there aren't any for these extensions. Ideally, this extension code would be living in the theme bundle itself... this is kind of a workaround haha.

At a minimum, can we make sure the build pipeline is self-checking? So if for some reason the extensions broke the docs build we'd get an alert and could go back in and fix it? My concern is we'd end up breaking the docs due to the nice search code and not realise.

Do you have any plans to build the extension into the theme longer-term?

@lbliii
Copy link

lbliii commented Jan 28, 2026

At a minimum, can we make sure the build pipeline is self-checking? So if for some reason the extensions broke the docs build we'd get an alert and could go back in and fix it? My concern is we'd end up breaking the docs due to the nice search code and not realise.

Do you have any plans to build the extension into the theme longer-term?

Hey @tgasser-nv, the real long-term plan is actually to migrate to Fern and stop using Sphinx. I built this extension before that decision was made, but FWIW both extensions are also designed defensively:

  • search_assets only runs after a successful build and handles missing files gracefully
  • json_output generates parallel JSON files without touching HTML output

The publish command is already strict/self checking and fails even on warnings -- so even if one link is broken or page not added to a toctree, you'd get blocked.

If you'd prefer to just close this and focus on migration, that might be better. I have staged PRs for exactly that for Data Designer and NeMo Curator you can take a look at. Heavy AI was the first to make the move.

@tgasser-nv
Copy link
Collaborator

At a minimum, can we make sure the build pipeline is self-checking? So if for some reason the extensions broke the docs build we'd get an alert and could go back in and fix it? My concern is we'd end up breaking the docs due to the nice search code and not realise.
Do you have any plans to build the extension into the theme longer-term?

Hey @tgasser-nv, the real long-term plan is actually to migrate to Fern and stop using Sphinx. I built this extension before that decision was made, but FWIW both extensions are also designed defensively:

  • search_assets only runs after a successful build and handles missing files gracefully
  • json_output generates parallel JSON files without touching HTML output

The publish command is already strict/self checking and fails even on warnings -- so even if one link is broken or page not added to a toctree, you'd get blocked.

If you'd prefer to just close this and focus on migration, that might be better. I have staged PRs for exactly that for Data Designer and NeMo Curator you can take a look at. Heavy AI was the first to make the move.

Hi @lbliii thanks for your reply. As long as we catch broken docs builds and don't push them to prod I'm happy to merge the extension. It's a much better experience for customers compared to the search page today and I'd rather get the improvements now rather than wait for a future migration. Thanks!

Copy link
Collaborator

@tgasser-nv tgasser-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as we catch failing docs builds and don't push them to production. Thanks @lbliii !

@miyoungc miyoungc merged commit 59f8d4f into develop Jan 28, 2026
37 checks passed
@miyoungc miyoungc deleted the miyoungc/docs-metadata branch January 28, 2026 21:05
@miyoungc
Copy link
Collaborator Author

Applied part of the greptile suggestions, and the rest we decided to ignore. Thank you @tgasser-nv for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants