Skip to content

Feature/266 bump clone multipliers to hit 40m100m dataset targets#268

Merged
jathavaan merged 3 commits into
mainfrom
feature/266-bump-clone-multipliers-to-hit-40m100m-dataset-targets
May 18, 2026
Merged

Feature/266 bump clone multipliers to hit 40m100m dataset targets#268
jathavaan merged 3 commits into
mainfrom
feature/266-bump-clone-multipliers-to-hit-40m100m-dataset-targets

Conversation

@jathavaan
Copy link
Copy Markdown
Collaborator

This pull request updates the synthetic dataset generation process to increase the number of clones per source polygon for both the MEDIUM and LARGE dataset sizes. The documentation and code have been updated to reflect these new clone counts, and the expected runtimes for dataset synthesis have been adjusted accordingly.

Dataset synthesis changes:

  • Increased the number of synthetic clones per source polygon for DatasetSize.MEDIUM from 7 to 9 and for DatasetSize.LARGE from 19 to 23 in the DatasetSize enum (src/domain/enums/dataset_size.py).
  • Updated the documentation in README.md to reflect the new clone counts for medium and large dataset sizes, including the code examples and explanatory text.

Benchmarking/runtime documentation:

  • Adjusted the documented expected runtimes for synthesizing medium and large datasets in README.md to account for the increased number of clones (medium: 30–50 min, large: 60–110 min).

jathavaan added 2 commits May 18, 2026 14:01
Source dataset is 4.17M polygons, not the ~5M assumed in #255, so
prior multipliers (7/19) produced ~33M / ~83M instead of ~40M / ~100M.

Refs #255
@jathavaan jathavaan self-assigned this May 18, 2026
Copilot AI review requested due to automatic review settings May 18, 2026 12:03
@jathavaan jathavaan linked an issue May 18, 2026 that may be closed by this pull request
@jathavaan jathavaan enabled auto-merge May 18, 2026 12:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates synthetic dataset sizing so medium and large benchmark datasets generate more clones per source polygon to better meet the documented row-count targets.

Changes:

  • Increased DatasetSize.MEDIUM clone multiplier from 7 to 9.
  • Increased DatasetSize.LARGE clone multiplier from 19 to 23.
  • Updated README dataset-size and runtime documentation to match the new multipliers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/domain/enums/dataset_size.py Updates clone multipliers and related enum documentation.
README.md Reflects new medium/large clone counts and synthesis runtime estimates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jathavaan jathavaan merged commit 8cc0262 into main May 18, 2026
36 checks passed
@jathavaan jathavaan deleted the feature/266-bump-clone-multipliers-to-hit-40m100m-dataset-targets branch May 18, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump clone multipliers to hit ~40M/~100M dataset targets

2 participants