Skip to content

Incorrect split of train and validation data #18

@imagejan

Description

@imagejan

I assume the intent of this code was to split every 10th frame into the validation dataset:

for i in tqdm(range(raw_zarr.shape[0]), leave=False):
annotated_plane = annotated_data[i : i + 1]
if (annotated_plane > 0).sum() > 0:
if i % 10 == 0:
add_to_zarr(
x_zarr_container=val_x_zarr,
y_zarr_container=val_y_zarr,
raw_data=raw_zarr[i : i + 1, config.brightfield_channel],
seg_data=annotated_plane,
name="val",
)
else:
add_to_zarr(
x_zarr_container=train_x_zarr,
y_zarr_container=train_y_zarr,
raw_data=raw_zarr[i : i + 1, config.brightfield_channel],
seg_data=annotated_plane,
name="train",
)

However, because we have an if clause checking for the presence of an annotation mask, the train/validation split is not 90/10.
Instead, frames are sorted into validation if the frame number is divisible by 10. For sparse annotations (e.g. of a 300-frame dataset, we annotated frame indices 8, 14, 15, 16, 17, 18, 53, 54, 90, 100), the train/validation split can lead to wrong ratios or completely empty validation sets.
If we want to split by linear sampling, we should use a separate counter variable independent of the frame index. Alternatively, we can emplot more elaborate methods for stratified/random sampling over multiple datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions