Release 0.9.4 by KasperThystrup · Pull Request #124 · ssi-dk/MMASeq

KasperThystrup · 2026-04-28T09:30:40Z

No description provided.

* Fixed the config files and the Snakemake + rules to account for those changes * The conda options for plasmidfinder and resfinder are back to general and not point to a specific env * Fixed paths * Fixed all the envs yaml file * Deactivated kmerfinder rule * Relative config paths and configfile added in Snakefile Part of requirements for resolving #5 * Added examples Added a public E. coli sample as well as a Actinobacillus pleuropneumoniae sample to test the E. coli pipe as well as the pipe for unspecified samples. Addresses #6 * Getting rid of Large File Storage dep * Fixed some path on configfile Added the resources folder to gitignore Minor fixes in workflow/ * Added AMRFinderPlus (#8) Added AMRfinderplus, requires preloaded DB pointed out by the Species_config file * Removed message, polished comments and added organisms support Addressed issues with pull request #8. Organism support is a bit yanky now though, as you would be expected to provide the --organism XYZ directly into the species specific config file. Alternately you would have to leave a blank string ('') * Automatic dbs (#9) * Added rules for automatically building CGEFinders databases. Added logging to CGEfinder rules Databases for CGEfinders are now automatically downloaded and indexed with KMA's if their db folder is missing. Logging has been set up to enhance debugging and usage information. This commit contributes to #5 #6 and #7 * Prepared for automatic AMRFinderDB Premerge automatic DB of AMRFinder prior to implementation * Added AMRfinder DB and polished messaging With the addition of AMRfinderPlus, Automatic Database handling has to come along. Now the database is automatically installed if missing. In addition messages are now named after their rule names, to enhance debugging * Corrected log positions, Amended AMRFinder db setup to names Logs are now placed into subcategory folders (based on sample or according to subject e.g. Databases * Cleaning: Removed redundant tools KMERFinder & Resistance Genes Detection (#10) Some tools are not used in practise any more and are thus redundant. Objects related to KMERFinder & Resistance Genes detection rules have been purged. Co-authored-by: SimoneScrima <simonescrima@gmail.com> * Added MLST By Torsten Seeman (#11) MLST has been added and formatted to fit with the Logging conventions and Messaging convensions of #9 --------- Co-authored-by: SimoneScrima <simonescrima@gmail.com> * Added temporary Automatic DB handler for Ecoli kmerAligner and polished kmeraligner rule (#12) Current Ecoli kmeraligner database is hosted as part of a repo, to simplify, only the db file is downloaded using wget (thus wget has been added to envs/kmeraligner.yaml). The repo is prob not stable (not maintained) and thus this fix should be seen as a temporary addition to #7. To make rule comply with latest Logging and Messaging setup, the kmeraligner rule have recieved a polishing. * Amend: Replaced wget with Curl Since firewalls seems to block wget, Curl is used as a replacement to wget. * Force update AMRFinder setup AMRfinder_update results in an error if database for some reason is not completely setup. Thus, everytime the setup rule is executed, it MUST overwrite the preexisting (if it's the same version) to mitigate future setup errors. Also corrected EcoliKmerAligner message. * Removed tasks not involved in routine While these tasks are NOT discarded on the long run, excluding these makes code more readable, while retaining the enhancements in commit history. * Updated README - Pull FIRST Make sure to update the supported tools section in all future pull requests, when relevant * Removed unusued scripts and envs as well * Added SerotypeFinder database creation (#14) SerotypeFinder database is now generated automatically (if missing). Added messaging and logging for serotypefinder rule. * Added update rule for MLST (#15) Due to some limitations in MLST setup, database updates can't be tied to snakemake rules output, and thus made as a dependency of the MLST execution. To manually force update (TAKES TIME\!) run * Update db_setups.smk Added time warning on the update_MLST rule * Added database creation file The database creation file is now chained into the MLST run, ensuring that if the database creation file does not exist, the DB is build * Removed multithreading since resulted in error in dowloading the database --------- Co-authored-by: SimoneScrima <simonescrima@gmail.com> * Added create.date files for each db_setup rules (#19) The file is a simple way of investegating the creation date for each used databases in the pipe. * Updated readme documentation (#34) * Changed the order of sections in the README.md * Added description of which files to change when including species * Update README.md * Removed abstract and TCO from TOC --------- Co-authored-by: Kasper Thystrup Karstensen <kathka@gmail.com> * Strict minimization - Deletions (#29) Removed a lot of files relating to NBDev, to enhance the overview of the repository. The removed files are not used in the later development of this pipe. * Removed additional samplesheet * Fixed config and snakemake following issue #26 suggestion (#37) Validated on Ubuntu * Removed unsupported rules For stability, unsupported rules where removed prior to pre-release. Rules corresponds to #38, #40, and #41 * Polished ResFinder command Resfinder command is no longer a python module call, rather an executionable command, altså implicitly stated location of the KMerAlinger database --------- Co-authored-by: SimoneScrima <simonescrima@gmail.com> Co-authored-by: Rasmus Amund Henriksen <36986552+RAHenriksen@users.noreply.github.com> Co-authored-by: Thor Bech Johannesen <thej@ssi.dk>

* Fixed the config files and the Snakemake + rules to account for those changes * The conda options for plasmidfinder and resfinder are back to general and not point to a specific env * Fix paths * Fixed all the envs yaml file Fixed rule kma,emm_typing and most finders Fixed E.coli yaml * Added conda yaml for CHtyper and cgMLST Modified the configuration E.Coli config file for the assembly lineage rule Modified the rule for assembly lineages using "shell" rather than "run" and executing the convert_external_genome.py Modified the convert_external_genome.py to be executed as a script * Deactivated kmerfinder rule * Relative config paths and configfile added in Snakefile Part of requirements for resolving #5 * Ammended forgotten paths and made a clean .gitignore * Added examples Added a pulbic E. coli sample as well as a Actinobacillus pleuropneumoniae sample to test the E. coli pipe as well as the pipe for unspecified samples. Addressess #6 * Amend: Example set DAMN dat examplesetgit add examples/! Relatives now in sample sheets... * Getting rid of Large File Storage dep Added a single extra BP at the end of the very first contigs in both samples. The BP is a duplicate of the previous last BP. #6 * Fixed some path on configfile Added the resources folder to gitignore Minor fixes in workflow/ * Added AMRFinderPlus (#8) Added AMRfinderplus, requires preloaded DB pointed out by the Species_config file * Removed message, polished comments and added organisms support Addressed issues with pull request #8. Organism support is a bit yanky now though, as you would be expected to provide the --organism XYZ directly into the species specific config file. Alternately you would have to leave a blank string ('') * Automatic dbs (#9) * Added rules for automatically building CGEFinders databases. Added logging to CGEfinder rules Databases for CGEfinders are now automatically downloaded and indexed with KMA's if their db folder is missing. Logging has been set up to enhance debugging and usage information. This commit contributes to #5 #6 and #7 * Prepared for automatic AMRFinderDB Premerge automatic DB of AMRFinder prior to implementation * Added AMRfinder DB and polished messaging With the addition of AMRfinderPlus, Automatic Database handling has to come along. Now the database is automatically installed if missing. In addition messages are now named after their rule names, to enhance debugging * Corrected log positions, Amended AMRFinder db setup to names Logs are now placed into subcategory folders (based on sample or according to subject e.g. Databases * Cleaning: Removed redundant tools KMERFinder & Resistance Genes Detection (#10) Some tools are not used in practise any more and are thus redundant. Objects related to KMERFinder & Resistance Genes detection rules have been purged. Co-authored-by: SimoneScrima <simonescrima@gmail.com> * Added MLST By Torsten Seeman (#11) MLST has been added and formatted to fit with the Logging conventions and Messaging convensions of #9 * Ammend: Forgot comma in Snakefile * Added temporary Automatic DB handler for Ecoli kmerAligner and polished kmeraligner rule (#12) * Added temporary Automatic DB handler for Ecoli kmerAligner and polished kmeraligner rule Current Ecoli kmeraligner database is hosted as part of a repo, to simplify, only the db file is downloaded using wget (thus wget has been added to envs/kmeraligner.yaml). The repo is prob not stable (not maintained) and thus this fix should be seen as a temporary addition to #7. To make rule comply with latest Logging and Messaging setup, the kmeraligner rule have recieved a polishing. * Amend: Replaced wget with Curl Since firewalls seems to block wget, Curl is used as a replacement to wget. * Amend: EcoliKmerAligner set to True * Force update AMRFinder setup AMRfinder_update results in an error if database for some reason is not completely setup. Thus, everytime the setup rule is executed, it MUST overwrite the preexisting (if it's the same version) to mitigate future setup errors. Also corrected EcoliKmerAligner message. * Removed tasks not involved in routine While these tasks are NOT discarded on the long run, excluding these makes code more readable, while retaining the enhancements in commit history. * Updated README - Pull FIRST Make sure to update the supported tools section in all future pull requests, when relevant * Removed scripts and envs as well * Added SerotypeFinder database creation (#14) SerotypeFinder database is now generated automatically (if missing). Added messaging and logging for serotypefinder rule. * Added update rule for MLST (#15) * Added update rule for MLST Due to some limitations in MLST setup, database updates can't be tied to snakemake rules output, and thus made as a dependency of the MLST execution. To manually force update (TAKES TIME\!) run * Update db_setups.smk Added time warning on the update_MLST rule * Added database creation file The database creation file is now chained into the MLST run, ensuring that if the database creation file does not exist, the DB is build * Removed multithreading since resulted in error in dowloading the database * Added create.date files for each db_setup rules (#19) The file is a simple way of investegating the creation date for each used databases in the pipe. * Updated readme documentation (#34) * Update README.md * Changed the order of sections in the README.md * Added description of which files to change when including species * Update README.md * Removed abstract and TCO from TOC * Strict minimization - Deletions (#29) Removed a lot of files relating to NBDev, to enhance the overview of the repository. The removed files are not used in the later development of this pipe. * Removed additional samplesheet * Fixed config and snakemake following issue #26 suggestion (#37) Validated on Ubuntu * Removed unsupported rules For stability, unsupported rules where removed prior to pre-release. Rules corresponds to #38, #40, and #41 * Polished ResFinder command Resfinder command is no longer a python module call, rather an executionable command, altså implicitly stated location of the KMerAlinger database * Updated Cdiff database * Modularized C.diff database creation for toxing and repeats and added sepearte snakemake rule to only test setup * Added the C.diff tools to create final output files * updated Cdiff test data to accomodate multiple ST's and their ploidy level in genotype calling procedure * adding Cdiff kma res handling to extend samplesheet * modularized samtools and bcftools for cdiff to accomodate potential different species and others tools requiring operations - added assembly skesa * seperated bcftools index commands * finalized output for Cdiff * temporary Cdiff output wrangler works for SNP/DEL at tcdC 117 and toxin identification - missing 18,36,39,54 del * tcdC region deletion accurately determined across several cases, using .call, mpileup indels * Cdiff Data Wrangling done * added temp to file for cleanup to minimize the output files for Cdiff results * updated Cdiff wrangler to handle ambiguous deletions, - needs clean up * updated bcftools call and indels processing adding multiple del confirmation steps - such as including the consensus * finalized threshold filtering on indels and added filtering for consensus sequence to increase support for ambiguous deletions * changed ecoli and cdiff data output wrangler py files to more appropriate names * removed renamed files * ensured accurate logging for cdiff * updated bcftools indexing * Commented out kleborate and CHtyper * Optimized skesa assembler Skesa assembler still automatically allocates memory, yet now it takes all workflow cores or max 8 cores per sample. In addition skesa output has been specified to enable easy inclusion in other rules. Finally, skesa output removed from rule all. * Environment made less version specific Environment couldn't be installed on local system, so removed a handfull of dependencies and removed version requirements * changed structure to add generalized KMA filter python script based on snakemake rules * changed the threshold filtering to species specific yaml configurations * updated assembly options, and tools to include spades. Skesa unavailable for MacOS * creating modularized repeat identifier and made thresholds in species specific config * modularized the repeat identification, works as a rule for all samples and all assemblies set as true * tmp update of wrangler to solely consider variant related parts * slight change with initial config reading * changed variant detection to solely rely on config files * temporary variant identifier under development * fixed repeat identifier to determine the combined TRST * slight improvements for variant identification in loop of steps * re-added deletion details and only search for the defined regions in the config * added another step to determine if deletions slighty deviate from expected length (biological/sequencing/alignment artefact) * adding gene_list - to generalizd - still fails for multiple snps.. * tmp change for variant to add the variation cdiff snakemake rule * updated and reclassified snp and deletions * updated config and species mapping * updated deletion classification from the consensus sequences * updated Snakefile to remove species map * generalized kma_filter to accomodate both e.coli and C.diff and removed the older data wrangler to keep the rule scripts * updated combined KMA rule, such that it does depend on the status allowing for different rules for different species * modularized KMA index from database setup, and generalized kmeraligner to work for e.coli and c.diff * modularized samtools faidx for db * generalized Cdiff TRST database and accurately trigger the db download in the others.smk Repeat_Identifier * updated all databases to handle kma index rule - changed finders.smk to accommodate changes * reverted back to previous commit 48290c9 last step for cdiff analysis * cleanup : removing f-strings * clean up : removed status and outcommented unused rules missing databases * Ignoring examples/Log * Fix the databases instances calls in the resfinder rule * Db generalization (#55) * Cleaning database logic, cleaning output file names, and removing redundant results from rule ALL Database calls for CGE finders are now internalized in indipendent rules only, rather than using an independent generalized kmer indexing rule. AMRFinder output files has been renamed. Several samtools and bcftools rules where called explicitly in rule All despite not being final results. * Generalized mapping rules and database calls A lot of the C. diff and E. coli specific custom mappers have been changed and attempted simplifyed. Major restructuring in the way databases are created and called * Custom identifiers tied into pipe Variant identifier and Repeat identifier is now tied in to the pipe. Generalized GenBank fetcher rule Made an more generalized Cdiff Toxin DB generation rule, which is applicable for variant calling. * Connected changes to rule all Updated rule all to contain the new wildcards and result files * Fixed bed header issue When creating Ctoxin DB, the bed6 header was missing a newline, resulting in a corrupt header, where certain data wasn't analysed (it was a silent error) * N meningitidis (#35) * Added menigotype rule to characterizers.smk for serotyping of Meningitidis isolates * Added Nmeningitidis config file along with associated tools and rules needed to run analysis. Also updated Snakefile to include menigotype in rule all and include assemblers.smk * Added Nmenengitidis reads * Introduce kleborate (#57) * Introduce module kleborate for the K.Pneumoniae and E.coli * Added K.pneumoniae support and samples * Fixes remaining conflicts and adjusted settings in the yaml !There are print statements ending up inside the snakemake STDOUT, but since stdout and err is allready set to be captured, it difficult to redirect this into the respective logs.! * Fix 21 kleborate (#63) * Minor polishing Changed keywords to options in species config, updated corresponding rule * Log cleanup (#44) * Log update, partial Log files inside python scripts where redirected separately from the snakemake fules, therefore they had to be changed to follow snakemake conventions * Halfed the corprovided cores for skesa and spades For development purposes, since Skesa is really slow, halfing the cores, ensures that the remaining pipe can continue, without halting entirely for assembly step * Deleted old smk rule * Htslib in separate file and adding bowtie (#69) * split the KMA alignment (kmeralignment) into two seperate rules for alignment and creating consensus sequence used for different purposes * seperated the samtools and bcftools which use the htslib library into a seperate smk file (#71) * updated the mappers to include bowtie2 * Bowtie2: Threads percentage replacement and renaming Threads have been changed to 33.33% of provided cores, to enable multiple instances to run simultanously. Adjustments are welcome! Renamed bowtie2aligner to just bowtie2 Cleansed configs according to #61 - Fields that are yet to be inserted into the code, are generally disadvised! Removed consensus and bowtie2 files from rule All, as they are considered intermediate results. Added temp for intermediates in KMERconsensus rule. * added database wildcard to custom_wrangler kma_filter * fixed kma filter input * updated log file for genbank fetcher before kma generalization change * Confs and paths (#75) Part 1 * Hard coded environment paths and database names for tools. Cleaned configs. Now environment paths are relative to the workflow/envs/ folder, while databses like Resfinder and amrfinder no longer relates to the config file and are likewise hardcoded. Cleaned config and fix NMiningitidis issue on AMRfinder * Extensive config cleanup Several species specific configs contained irrelevant and outdated information. Reduced these drastically to only contain relevant analysis and potential optional rules * Empty options for empty analysis, renamed config vars In order to make analysis configs more clean to look at, options keyword have been added. It's not planeed that options MUST be there, but added it to lead users to understand how to provide arguments. Cleaned up in configuration file variables and made them lowercase for cleaner look * Confs and paths Part 2 of 2 * Cleaned env files Removed unused env files and collapsed envs doing the same thing * Kleborate patch + minor fixes * Separated assemblies (#72) * Separated assemblies and made central assembly call Assemblies are now split into skesa, spades, or shovill. It's planned to add the assembly types as part of the species configs. Right now they're hardcoded into rule all. Should resolve #59, and should support #56 in future. * Generalized genbank fetcher rule to work with metafiles (#73) * updated genbank fetcher to extract partial regions of CDS both relative to the gene and genome * added strand orientation correction to reverse complement sequences from negative strand * Updated genbank fetcher and validated it works for accession region, locus region, locus and similar with info from meta file * updated genbank fetcher to work for generalized samples * final changes for fetcher and meta files before PR * Migrated metadata path to overall config Metadata path is now managed through the main config file * generalized KMA custom wrangler rule to accomodate multiple loci and species from metafile (#74) * Ambiguity issues (#80) * Combined the indexes for htslib Ambigouitty due to multiple runs of HTSlib Index can be prevented by running indexing for each pileup * Raah variantidentifier meta (#79) * modularized variant identifier - into the first part being snp identifier with its seperate rule and metadata * seperated deletion identifier from variant identifier to work independently from snp_identifier * finalized snp- and variant identifier seperation and generalization while cleaning up config by using metafiles * minor changes --------- Co-authored-by: RAHenriksen <rah@hotmail.dk> --------- Co-authored-by: RAHenriksen <rah@hotmail.dk> * minor clean up * MissingInputException in setup_custom_kmeraligner_index for genbank fetcher discrepancy between genbank output folder and deletion- and snpidentifier input folder - creating MissingInputException in rule setup_custom_kmeraligner_index * minor clean up adding index and correct envrionment * Removed the need for expand functionality in Rule all by defining list of results (#83) * tmp try for list_result reducing the need for expand, by creating a list of expected results from samplesheet and species configs - currently only works for KMA related files * list_results expanded to other analysis generalized the analaysis to run to include settigns and to handle a list of assemblies, with all rules using done flag * fixed pointfinder db setup * update config files renamed config files to accurate species nomenclature and ensure correct species name mapping * test list_results for all finders updated finders for cdiff and ecoli species making sure all rules work with list_results function * updated and tested rule all for all species and fixed different issues * updated assembly output fixed the specific assembly output to accomodate the .done files which is created by list_results to ensure correct activation. works for all species with one or more assemblies for skesa,spades,shovill * clean up snakefile moved the list_results into its own independent python file to make the snakefile cleaner and less complicated. * additional clean up of snakefile clean up of Snakefile by separating python functionality to the rule_all_function and keeping samplesheet and config structure in Snakefile * moved touch to shell in attempt to use temp * Minor corrections, Done contains messages Messages to the user should have tool names with correct casing (e.g. ResFinder not resfinder). In addition, to increase user friendlyness, a rule specific mesages has been added for each file. * Readuced sample amount in dl_script.sh * Minor fixes Removed defaults from genbank_fetcher, as it MAY create output files W/O notifying the user with dest_missing type err. * Added functionality for S.enterica (#85) * S.enterica rules - seqsero2 * added sistr functionality * Added salmonella sample to dl_script * Spatyper draft SMK should work, but command is untested and no output file is yet pointed to * Fixed blast (#84) * Cleaned the custom_blaster rule * Cleaned the code of custom_blaster, added a rule for custom blast databases, needs to be fixed later when we will have an online repo where to store databases, such that it can be fetched * Added the forgotten oxa_ndm databases * Replaced lambda functions and minor renaming Lambda functions are not nescesary as input for blasting, as such these has been replaced with rule handlers. Also polished the naming and message a bit. Does NOT work with S. enterica sistr analysis! * Reverted characteriser assembly inputs and removed assembler done Characteriser rules input assemblies was generally changed from lambda functions to rules.assembly. Also removed the temp done file of the assembly rule to reduce rule all steps * Fixed small error in rule fetch_blast_database regardaing output * Added default species config and changed first test sample config Added a defaults scheme, for running unknown species, Changed first sample from APP to default * Not tested due to local Conda issues with sistr installation, contains spa-typing and LREfinder from #89 & #90 * Fixed VirulenceF out, Polished Rules 'n configs, simplified sistr.yaml * VirulenceFinder was missing a flag for extended output, which has been added. * Polished finder rules to point to specific output files rather than dir. * Polished cdiff_repeat output folder to match rule name exactly. * Reduced amount of used assemblies in config files to speed up test runs. * Changed list objects in configs to string, if only one item where specified (e.g. assemblies: [skesa] => assemblies: skesa * Redirected mock sample to default * Added support to CHtyper via kmerfinder (#92) * Added results catalogue * update sample sheet to configs instead of organism * initial long table simply reading files * updated the file using wildcards extracted from configs * longtable creation specific to each sample * update long table to extend across samples into one file * updated samplesheet and config * update workflow for new config updated list all output, snakefile and config files to accomodate the new samplesheet directly referring to sample config and without the organism name map * updated all snakemake files to accomodate no sample to organism map * species config change * Finalized LREFinder * Rule all points to final results * Repurposed logic for handling configurations per sample, and merged sample_configs with result files catalogue for propagating rule all with all results files. * Corrected list_results to include all results. Polishing and corrected seqsero *list_result now takes configuration keys as lists, as two identical dict keys will overwrite each other. * Results_catalogue import function included for cleaning up snakefile * seqsero pointed to output dir rather than results files, thus smk couldn't define how to create output * Renamed helperfunctions from rule_all_functions * Removed unused config layer ('input manager') * updated longtable to be a rule and to accomodate new config file * changed sample specific output folder for longtable * works with long table format for all samples and sample specific * update logging for cdiff * Refactor YAML structure for analyses configuration Updated to latest configurations * Added NDM-5 fragment * changed database to accmodate currently remote custom github repo * update default species values to accomdate the samplesheet structure and rules * Migrating Metadata to config and cleaning up config file name (#98) * The examples folder is dedicated to temporary and results files, for consistency Metadata has been migrated to config * Species config files has been renamed to a more system acceptable format. * Removed Longtable example, as the pipe now outputs actual results. * addec Cdiff (#103) * updated SNP identification 1) Simplified snp identification to only consider genotype calls and metafile. 2) added DP filterig for metafile. 3) changed output format * deletion identifier changes simplified input options similar to snp identifier, changed the scoring scheme for deletion identification, ensured correct genomic regional serach for del * temporary 3 category classification of deletions from call or pileup * update scoring scheme for categories * include the usage of consensus sequence to identify deletions with N in the region * filtering on consensus sequence N percent * updated the categories, by upgrading based on consensus sequence support * added sources for the categories * update deletion identifier for assembly alignment support * added assembly-to-reference alignment to support deletion identification * update with assembly for deletion identifier to activate minimap2 * slight change in de novo assembly classification * Updated long table format to include row index from original input file * version for custom download * update serovar list version * etag for s enterica and lrefinder * added chtyper * finalized all database versions * added database versions to blast * added tool versions to assemblers and custom aligners for ecoli * Added Spatyperv2 script, a faster reimplementation of the original analysis --------- Co-authored-by: SimoneScrima <simonescrima@gmail.com> Co-authored-by: KasperThystrup <kathka@gmail.com> Co-authored-by: RAHenriksen <rah@hotmail.dk>

KasperThystrup and others added 3 commits August 6, 2025 15:22

Merge branch 'main' into 'dev-main' and 'dev-main' into 'main'

c2f1815

KasperThystrup merged commit f7e33aa into dev Apr 28, 2026
1 check passed

KasperThystrup deleted the main branch April 28, 2026 09:30

KasperThystrup restored the main branch April 28, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.9.4#124

Release 0.9.4#124
KasperThystrup merged 3 commits into
devfrom
main

KasperThystrup commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KasperThystrup commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant