-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I assume the intent of this code was to split every 10th frame into the validation dataset:
ggrossha_SWI/source/s02_segment/prepare_train_data.py
Lines 80 to 98 in de528dc
| for i in tqdm(range(raw_zarr.shape[0]), leave=False): | |
| annotated_plane = annotated_data[i : i + 1] | |
| if (annotated_plane > 0).sum() > 0: | |
| if i % 10 == 0: | |
| add_to_zarr( | |
| x_zarr_container=val_x_zarr, | |
| y_zarr_container=val_y_zarr, | |
| raw_data=raw_zarr[i : i + 1, config.brightfield_channel], | |
| seg_data=annotated_plane, | |
| name="val", | |
| ) | |
| else: | |
| add_to_zarr( | |
| x_zarr_container=train_x_zarr, | |
| y_zarr_container=train_y_zarr, | |
| raw_data=raw_zarr[i : i + 1, config.brightfield_channel], | |
| seg_data=annotated_plane, | |
| name="train", | |
| ) |
However, because we have an if clause checking for the presence of an annotation mask, the train/validation split is not 90/10.
Instead, frames are sorted into validation if the frame number is divisible by 10. For sparse annotations (e.g. of a 300-frame dataset, we annotated frame indices 8, 14, 15, 16, 17, 18, 53, 54, 90, 100), the train/validation split can lead to wrong ratios or completely empty validation sets.
If we want to split by linear sampling, we should use a separate counter variable independent of the frame index. Alternatively, we can emplot more elaborate methods for stratified/random sampling over multiple datasets.