Feature/266 bump clone multipliers to hit 40m100m dataset targets by jathavaan · Pull Request #268 · kartAI/doppa

jathavaan · 2026-05-18T12:02:59Z

This pull request updates the synthetic dataset generation process to increase the number of clones per source polygon for both the MEDIUM and LARGE dataset sizes. The documentation and code have been updated to reflect these new clone counts, and the expected runtimes for dataset synthesis have been adjusted accordingly.

Dataset synthesis changes:

Increased the number of synthetic clones per source polygon for DatasetSize.MEDIUM from 7 to 9 and for DatasetSize.LARGE from 19 to 23 in the DatasetSize enum (src/domain/enums/dataset_size.py).
Updated the documentation in README.md to reflect the new clone counts for medium and large dataset sizes, including the code examples and explanatory text.

Benchmarking/runtime documentation:

Adjusted the documented expected runtimes for synthesizing medium and large datasets in README.md to account for the increased number of clones (medium: 30–50 min, large: 60–110 min).

Source dataset is 4.17M polygons, not the ~5M assumed in #255, so prior multipliers (7/19) produced ~33M / ~83M instead of ~40M / ~100M. Refs #255

Copilot

Pull request overview

This PR updates synthetic dataset sizing so medium and large benchmark datasets generate more clones per source polygon to better meet the documented row-count targets.

Changes:

Increased DatasetSize.MEDIUM clone multiplier from 7 to 9.
Increased DatasetSize.LARGE clone multiplier from 19 to 23.
Updated README dataset-size and runtime documentation to match the new multipliers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/domain/enums/dataset_size.py`	Updates clone multipliers and related enum documentation.
`README.md`	Reflects new medium/large clone counts and synthesis runtime estimates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…m100m-dataset-targets

jathavaan added 2 commits May 18, 2026 14:01

#266 Bump MEDIUM/LARGE clone multipliers to 9/23

dcb1a2d

Source dataset is 4.17M polygons, not the ~5M assumed in #255, so prior multipliers (7/19) produced ~33M / ~83M instead of ~40M / ~100M. Refs #255

#266 Sync README clone counts and runtime estimates

ea4472f

jathavaan self-assigned this May 18, 2026

Copilot AI review requested due to automatic review settings May 18, 2026 12:03

jathavaan linked an issue May 18, 2026 that may be closed by this pull request

Bump clone multipliers to hit ~40M/~100M dataset targets #266

Closed

jathavaan enabled auto-merge May 18, 2026 12:03

Copilot started reviewing on behalf of jathavaan May 18, 2026 12:03 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Merge branch 'main' into feature/266-bump-clone-multipliers-to-hit-40…

4b763bb

…m100m-dataset-targets

jathavaan merged commit 8cc0262 into main May 18, 2026
36 checks passed

jathavaan deleted the feature/266-bump-clone-multipliers-to-hit-40m100m-dataset-targets branch May 18, 2026 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/266 bump clone multipliers to hit 40m100m dataset targets#268

Feature/266 bump clone multipliers to hit 40m100m dataset targets#268
jathavaan merged 3 commits into
mainfrom
feature/266-bump-clone-multipliers-to-hit-40m100m-dataset-targets

jathavaan commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jathavaan commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants