-
Notifications
You must be signed in to change notification settings - Fork 5
Description
https://github.com/qiime2/galaxy-tools/tree/main/tools/suite_qiime2_core__tools
toolshed.g2.bx.psu.edu/repos/q2d2/qiime2_core__tools__import/qiime2_core__tools__import/2023.5.0+dist.h193f7cc9.3
Hello! The latest wrapper update is great! It might need a small documentation update in Galaxy to reflect the newer usage. A few users have noticed that the format for element identifiers has become slightly more strict. In practical use, this is the format of the Illumina fastq sequence files that is parsed into a list collection, then later consumed by qimme2 import.
For an example, please see the discussion yesterday here: https://help.galaxyproject.org/t/qiime2-2025-10-errors/16577
Now, "real" data would already have this stricter formatting, but we have many people working in Galaxy for exploratory reasons, and importantly, many instructors with downsampled data used for teaching purposes. They are newly having problems with confusing errors. If we could clarify this update to the format a bit more, it would very helpful! We want them to pass through this first step as easily as possible and use the package.
I don't think the change is a bug! Stricter format is fine. My hope is that we could make the requirements clearer.
How the tool can error
This is from an example I was reviewing from an instructor.
A pair of paired-end files were found not to have the same number of records. /corral4/main/jobs/XXX/XXX/XXXXXX/working/q2galaxy-importx7a1ak5m/4_S4_L001_R2_001.fastq.gz has 131348 records. /corral4/main/jobs/XXX/XXX/XXXXXX/working/q2galaxy-importx7a1ak5m/24_S24_L001_R2_001.fastq.gz has 372128 records.
Notice how the tool is attempting to "match up" sample 4_S4 and 24_S24. The sample ID seems to be truncated when creating the qiime2 index based on the collection's element identifiers.
Detail
This recommendation for the original sample formatting
.+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz
The update requires that the values contained in .+ expressions need to be a consistent character length for all in the same Qiime2 Import batch.
Meaning, a group like this will result in an error.
1_s1_L001_R1_001.fastq.gz
2_s2_L001_R1_001.fastq.gz
11_s11_L001_R1_001.fastq.gz
But padding out the values to all be the same character length like this works.
01_s01_L001_R1_001.fastq.gz
02_s02_L001_R1_001.fastq.gz
11_s11_L001_R1_001.fastq.gz
But -- I've also noticed that if the first item in the list has the longest padding length, that will also work. Or, at least as far as the import step! Meaning, this will also work.
11_s11_L001_R1_001.fastq.gz
1_s1_L001_R1_001.fastq.gz
2_s2_L001_R1_001.fastq.gz
I'm not sure if those samples could be mixed up with later tools? I saw this topic about something similar that may be related. #50
Enhancement request
If this change was intentional (setting the variable character lengths based on the first element in the list?), I think we should update the tool form help. Note: I didn't test whether the first .+ match is the root change, or if it is the second .+, or if it is both!
This
This data should be formatted as a FastqGzFormat. See the documentation below for more information. Elements must match regex: .+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz
To something like this
This data should be formatted as a FastqGzFormat. Elements must match regex: .+_.+_L[0-9][0-9][0-9]_R[12]_001.fastq.gz. All .+ matches must be padded to a consistent character length. See the documentation below for more information.
Then, add more details to the Help section, maybe with an example.
I can make a suggestion in a PR -- but first I wanted to confirm that this change was intentional. Thanks!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status