Incorrect split of train and validation data

I assume the intent of this code was to split every 10th frame into the validation dataset:

https://github.com/fmi-basel/ggrossha_SWI/blob/de528dc66f754327942e47448e21ec111be28250/source/s02_segment/prepare_train_data.py#L80-L98

However, because we have an `if` clause checking for the presence of an annotation mask, the train/validation split is not 90/10.
Instead, frames are sorted into validation if the frame number is divisible by 10. For sparse annotations (e.g. of a 300-frame dataset, we annotated frame indices 8, 14, 15, 16, 17, 18, 53, 54, 90, 100), the train/validation split can lead to wrong ratios or completely empty validation sets.
If we want to split by linear sampling, we should use a separate counter variable independent of the frame index. Alternatively, we can emplot more elaborate methods for stratified/random sampling over multiple datasets.


	for i in tqdm(range(raw_zarr.shape[0]), leave=False):
	annotated_plane = annotated_data[i : i + 1]
	if (annotated_plane > 0).sum() > 0:
	if i % 10 == 0:
	add_to_zarr(
	x_zarr_container=val_x_zarr,
	y_zarr_container=val_y_zarr,
	raw_data=raw_zarr[i : i + 1, config.brightfield_channel],
	seg_data=annotated_plane,
	name="val",
	)
	else:
	add_to_zarr(
	x_zarr_container=train_x_zarr,
	y_zarr_container=train_y_zarr,
	raw_data=raw_zarr[i : i + 1, config.brightfield_channel],
	seg_data=annotated_plane,
	name="train",
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect split of train and validation data #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect split of train and validation data #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions