Skip to content

Enhancement: clarify the element formats tighter specification NEW in qiime2_core__tools__import/2023.5.0+dist.h193f7cc9.3Β #86

@jennaj

Description

@jennaj

https://github.com/qiime2/galaxy-tools/tree/main/tools/suite_qiime2_core__tools
toolshed.g2.bx.psu.edu/repos/q2d2/qiime2_core__tools__import/qiime2_core__tools__import/2023.5.0+dist.h193f7cc9.3

Hello! The latest wrapper update is great! It might need a small documentation update in Galaxy to reflect the newer usage. A few users have noticed that the format for element identifiers has become slightly more strict. In practical use, this is the format of the Illumina fastq sequence files that is parsed into a list collection, then later consumed by qimme2 import.

For an example, please see the discussion yesterday here: https://help.galaxyproject.org/t/qiime2-2025-10-errors/16577

Now, "real" data would already have this stricter formatting, but we have many people working in Galaxy for exploratory reasons, and importantly, many instructors with downsampled data used for teaching purposes. They are newly having problems with confusing errors. If we could clarify this update to the format a bit more, it would very helpful! We want them to pass through this first step as easily as possible and use the package.

I don't think the change is a bug! Stricter format is fine. My hope is that we could make the requirements clearer.

How the tool can error

This is from an example I was reviewing from an instructor.

A pair of paired-end files were found not to have the same number of records. /corral4/main/jobs/XXX/XXX/XXXXXX/working/q2galaxy-importx7a1ak5m/4_S4_L001_R2_001.fastq.gz has 131348 records. /corral4/main/jobs/XXX/XXX/XXXXXX/working/q2galaxy-importx7a1ak5m/24_S24_L001_R2_001.fastq.gz has 372128 records.

Notice how the tool is attempting to "match up" sample 4_S4 and 24_S24. The sample ID seems to be truncated when creating the qiime2 index based on the collection's element identifiers.

Detail

This recommendation for the original sample formatting

.+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz

The update requires that the values contained in .+ expressions need to be a consistent character length for all in the same Qiime2 Import batch.

Meaning, a group like this will result in an error.

1_s1_L001_R1_001.fastq.gz
2_s2_L001_R1_001.fastq.gz
11_s11_L001_R1_001.fastq.gz

But padding out the values to all be the same character length like this works.


01_s01_L001_R1_001.fastq.gz
02_s02_L001_R1_001.fastq.gz
11_s11_L001_R1_001.fastq.gz

But -- I've also noticed that if the first item in the list has the longest padding length, that will also work. Or, at least as far as the import step! Meaning, this will also work.

11_s11_L001_R1_001.fastq.gz
1_s1_L001_R1_001.fastq.gz
2_s2_L001_R1_001.fastq.gz

I'm not sure if those samples could be mixed up with later tools? I saw this topic about something similar that may be related. #50

Enhancement request

If this change was intentional (setting the variable character lengths based on the first element in the list?), I think we should update the tool form help. Note: I didn't test whether the first .+ match is the root change, or if it is the second .+, or if it is both!

This

This data should be formatted as a FastqGzFormat. See the documentation below for more information. Elements must match regex: .+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz

To something like this

This data should be formatted as a FastqGzFormat. Elements must match regex: .+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz. All .+ matches must be padded to a consistent character length. See the documentation below for more information.

Then, add more details to the Help section, maybe with an example.

I can make a suggestion in a PR -- but first I wanted to confirm that this change was intentional. Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions