Skip to content

Simple peptide sequence comparison #1057

@k-blenman

Description

@k-blenman

I am trying to use your tool to compare the short peptide sequences of 25 peptides with each other. The results.m8.gz file should have 625 pairs of peptides outputted however it only has 30 pairs. This is either a default setting that I have yet to uncover or a bug. Here is my code:

#!/bin/bash
#SBATCH --job-name=mmseq_mini
#SBATCH --partition=devel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=1:00:00
#SBATCH --output=mmseqs_allvsall_%j.out
#SBATCH --error=mmseqs_allvsall_%j.err

-------------------------------------------------------------------

Parameters that adjust sensitivity. These are 'maxed out' for maximum

sensitivity (and maximum output file size). Tweaking can reduce the outputfile

size if needed:

MIN_SEQ_ID (minimum sequence identity) is between 0 and 100;

SENSITIVITY is between 0 and 10;

MAX_SEQS (max # sequences to match per input sequence) Between 1 and the number of input sequences

MIN_SEQ_ID=0.0
MAX_SEQS=3000000
SENSITIVITY=10

-------------------------------------------------------------------

---- Load environment ----

module purge
module load MMseqs2

---- Define paths ----

WORKROOT=$SLURM_SUBMIT_DIR
INPUT_TABLE="$WORKROOT/file_v1mini.txt"
TMPDIR_LOCAL="/tmp/$USER/mmseqs_${SLURM_JOB_ID}"
OUTDIR="$WORKROOT/mmseqs_allvsall_out"

mkdir -p "$OUTDIR"
mkdir -p "$TMPDIR_LOCAL"

---- Step 1. Convert input table (i.e., text metadata file containing SeqID and peptide sequence) to FASTA (unique headers) ----

PEPTIDE_FASTA="$TMPDIR_LOCAL/peptides_unique.fasta"
awk -F'\t' 'NR>1 && $10!="" {print ">"$1"__"NR"\n"$10}' "$INPUT_TABLE" > "$PEPTIDE_FASTA"

---- Step 2. Create MMseqs2 database ----

PEPTIDE_DB="$TMPDIR_LOCAL/peptidesDB"
mmseqs createdb "$PEPTIDE_FASTA" "$PEPTIDE_DB"

---- Step 3. Prepare header file for .m8 output ----

HEADER_FILE="$OUTDIR/out.header"
echo -e "query\ttarget\tpident\talnlen\tmismatch\tgapopen\tqstart\tqend\ttstart\ttend\tevalue\tbitscore" > "$HEADER_FILE"

---- Step 4. Run all-vs-all search ----

On HPC 'mmseqs search' command needs to be wrapped by a call to srun to avoid a parallization failure (MPI)

RESULT_DB="$TMPDIR_LOCAL/resultDB"
srun mmseqs search "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMPDIR_LOCAL"
--threads $SLURM_CPUS_PER_TASK
--min-seq-id $MIN_SEQ_ID
--max-seqs $MAX_SEQS
-s $SENSITIVITY

---- Step 5. Convert to BLAST-like tabular .m8 (compressed) ----

OUT_M8="$OUTDIR/out.m8.gz"
TMP_M8="$TMPDIR_LOCAL/out.tmp.m8"
mmseqs convertalis "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMP_M8"
--format-output "query,target,pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits"
gzip -c "$TMP_M8" > "$OUT_M8"
rm -f "$TMP_M8"

---- Step 6. Filter self-hits and short/low-identity alignments ----

FILTERED_M8="$OUTDIR/out.filtered.m8.gz"
zcat "$OUT_M8" | awk -F'\t' '$1 != $2 && $3 >= 30 && $4 >= 6' | gzip > "$FILTERED_M8"

---- Step 7. Optional: global alignment on filtered hits only ----

GLOBAL_RESULT_DB="$TMPDIR_LOCAL/resultDB_global"
GLOBAL_M8="$OUTDIR/out.filtered_global.m8.gz"

Extract filtered IDs

zcat "$FILTERED_M8" | cut -f1,2 | tr '\t' '\n' | sort -u > "$TMPDIR_LOCAL/filtered_ids.txt"

---- Step 8. Cleanup ----

rm -rf "$TMPDIR_LOCAL"

echo "[$(date)] All-vs-all MMseqs2 workflow completed successfully."

Any thoughts or suggestions would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions