Skip to content

Fix Windows path collision in hotel invoice data#2495

Open
itsryu-dq wants to merge 5 commits intoopenai:mainfrom
itsryu-dq:Fix/windows-path-portability
Open

Fix Windows path collision in hotel invoice data#2495
itsryu-dq wants to merge 5 commits intoopenai:mainfrom
itsryu-dq:Fix/windows-path-portability

Conversation

@itsryu-dq
Copy link

@itsryu-dq itsryu-dq commented Mar 6, 2026

Summary

This PR fixes a cross-platform checkout failure caused by a trailing-space path collision under:

examples/data/hotel_invoices/

The repository previously contained two directory paths:

  • examples/data/hotel_invoices/extracted_invoice_json/
  • examples/data/hotel_invoices/extracted_invoice_json

The second directory contained a trailing ASCII space in the name. Because both directories contained the same 31 filenames, Windows path normalization collapsed them into identical paths, causing checkout failures on native Windows environments.

This PR removes the trailing-space dataset tree and keeps the canonical dataset path already referenced in the repository:

examples/data/hotel_invoices/extracted_invoice_json/

Additionally, this PR introduces a repository path portability guard and CI validation to prevent similar filesystem portability issues from occurring again.


Motivation

This change restores compatibility for contributors using native Windows environments.

Windows filesystems do not support path components that end with a trailing space. When both directories existed in the repository, Windows normalized them to the same path during checkout, causing failures such as:


error: invalid path 'examples/data/hotel_invoices/extracted_invoice_json /20190119_002_extracted.json'
fatal: unable to checkout working tree

This issue originated from commit:

ffdd52937d0c82d4fe3e85314ad88439c4a0e3ce

which was merged through:

PR #1273 – "Data Extraction & Transformation with GPT-4o"

#1273

The PR was opened and merged by charu-openai, with several content commits contributed by charuj and reviewed by msingh-openai.

Because Linux filesystems allow trailing-space directory names while Windows does not, the issue remained invisible until a Windows checkout attempted to materialize the working tree.


Changes in this PR

  • removed examples/data/hotel_invoices/extracted_invoice_json
  • retained canonical dataset path examples/data/hotel_invoices/extracted_invoice_json
  • added .github/scripts/check_path_portability.py
  • added CI validation to detect:
    • trailing-space path components
    • trailing-period path components
    • Windows reserved device names
    • Windows-normalized path collisions
  • integrated the portability check into .github/workflows/validate-notebooks.yaml
  • added cross-platform path safety documentation to:
    • README.md
    • CONTRIBUTING.md

Result

After this change the repository is fully checkoutable on:

  • Windows
  • macOS
  • Linux
  • WSL

without requiring sparse checkout or filesystem workarounds.

The CI portability guard prevents similar path portability issues from being introduced in future commits.


For new content

This PR does not add new cookbook content and only addresses repository portability and infrastructure.

  • I have added a new entry in registry.yaml so that my content renders on the cookbook website.

  • I have conducted a self-review of my content based on the contribution guidelines:

    • Relevance
    • Uniqueness
    • Spelling and Grammar
    • Clarity
    • Correctness
    • Completeness

Not applicable for this PR.

ryu-omnithrex added 2 commits March 6, 2026 20:51
Add author metadata for ryu-omnithrex so the GPT-5.4 prompting guide notebook displays proper attribution on cookbook.openai.com.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7bee06161

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@itsryu-dq
Copy link
Author

This PR is ready for maintainer review.

Summary of final updates:

Removed the trailing-space dataset directory:

examples/data/hotel_invoices/extracted_invoice_json

and retained the canonical dataset path:

examples/data/hotel_invoices/extracted_invoice_json

The repository previously contained both paths with the same 31 JSON filenames. Because Windows filesystems normalize trailing spaces in path components, this created 31 path collisions and caused checkout failures on native Windows environments.

The issue originated from commit:

ffdd52937d0c82d4fe3e85314ad88439c4a0e3ce

merged through PR #1273 – Data Extraction & Transformation with GPT-4o.

This PR also adds a repository path portability guard and CI validation to prevent similar cross-platform filesystem issues from being introduced in the future.

Validation performed:

  • ran .github/scripts/check_path_portability.py
  • verified no trailing-space or trailing-period path components remain
  • verified no Windows-normalized collisions remain
  • performed clean repository clone using core.protectNTFS=true
  • confirmed notebook validation passes

From my side the repository tree is now free of Windows-hostile path components and protected by CI path validation.

Happy to adjust anything maintainers prefer regarding dataset handling or documentation wording.

@itsryu-dq itsryu-dq force-pushed the Fix/windows-path-portability branch from 6589d37 to ad5f52f Compare March 6, 2026 20:25
@itsryu-dq
Copy link
Author

Final update from my side.

Addressed the automated Codex review suggestion regarding control characters in the path portability guard. The check previously exempted \t, \r, and \n, but Windows forbids all control characters (0x01–0x1F) in path components. The guard now rejects all control characters so the CI rule fully aligns with Windows filesystem constraints.

Current PR changes include:

  • removal of the trailing-space dataset directory that caused Windows checkout collisions
  • introduction of the repository path portability guard
  • CI validation to prevent future cross-platform path issues
  • Windows contributor guidance added to README.md and CONTRIBUTING.md
  • authors.yaml entry for ryu-omnithrex
  • follow-up fix tightening the control-character validation

All local checks pass and the portability guard validates the repository tree successfully.

From my side this PR is complete and ready for maintainer review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant