Simple peptide sequence comparison

I am trying to use your tool to compare the short peptide sequences of 25 peptides with each other. The results.m8.gz file should have 625 pairs of peptides outputted however it only has 30 pairs. This is either a default setting that I have yet to uncover or a bug. Here is my code:

#!/bin/bash
#SBATCH --job-name=mmseq_mini
#SBATCH --partition=devel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=1:00:00
#SBATCH --output=mmseqs_allvsall_%j.out
#SBATCH --error=mmseqs_allvsall_%j.err

# -------------------------------------------------------------------
# Parameters that adjust sensitivity. These are 'maxed out' for maximum
#  sensitivity (and maximum output file size). Tweaking can reduce the outputfile
#  size if needed:
#    MIN_SEQ_ID (minimum sequence identity) is between 0 and 100;
#    SENSITIVITY is between 0 and 10;
#    MAX_SEQS (max # sequences to match per input sequence) Between 1 and the number of input sequences
MIN_SEQ_ID=0.0
MAX_SEQS=3000000
SENSITIVITY=10
# -------------------------------------------------------------------

# ---- Load environment ----
module purge
module load MMseqs2

# ---- Define paths ----
WORKROOT=$SLURM_SUBMIT_DIR
INPUT_TABLE="$WORKROOT/file_v1mini.txt"
TMPDIR_LOCAL="/tmp/$USER/mmseqs_${SLURM_JOB_ID}"
OUTDIR="$WORKROOT/mmseqs_allvsall_out"

mkdir -p "$OUTDIR"
mkdir -p "$TMPDIR_LOCAL"

# ---- Step 1. Convert input table (i.e., text metadata file containing SeqID and peptide sequence) to FASTA (unique headers) ----
PEPTIDE_FASTA="$TMPDIR_LOCAL/peptides_unique.fasta"
awk -F'\t' 'NR>1 && $10!="" {print ">"$1"__"NR"\n"$10}' "$INPUT_TABLE" > "$PEPTIDE_FASTA"

# ---- Step 2. Create MMseqs2 database ----
PEPTIDE_DB="$TMPDIR_LOCAL/peptidesDB"
mmseqs createdb "$PEPTIDE_FASTA" "$PEPTIDE_DB"

# ---- Step 3. Prepare header file for .m8 output ----
HEADER_FILE="$OUTDIR/out.header"
echo -e "query\ttarget\tpident\talnlen\tmismatch\tgapopen\tqstart\tqend\ttstart\ttend\tevalue\tbitscore" > "$HEADER_FILE"

# ---- Step 4. Run all-vs-all search ----
# On HPC 'mmseqs search' command needs to be wrapped by a call to srun to avoid a parallization failure (MPI)
RESULT_DB="$TMPDIR_LOCAL/resultDB"
srun mmseqs search "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMPDIR_LOCAL" \
    --threads $SLURM_CPUS_PER_TASK \
    --min-seq-id $MIN_SEQ_ID \
    --max-seqs $MAX_SEQS \
    -s $SENSITIVITY

# ---- Step 5. Convert to BLAST-like tabular .m8 (compressed) ----
OUT_M8="$OUTDIR/out.m8.gz"
TMP_M8="$TMPDIR_LOCAL/out.tmp.m8"
mmseqs convertalis "$PEPTIDE_DB" "$PEPTIDE_DB" "$RESULT_DB" "$TMP_M8" \
    --format-output "query,target,pident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits"
gzip -c "$TMP_M8" > "$OUT_M8"
rm -f "$TMP_M8"

# ---- Step 6. Filter self-hits and short/low-identity alignments ----
FILTERED_M8="$OUTDIR/out.filtered.m8.gz"
zcat "$OUT_M8" | awk -F'\t' '$1 != $2 && $3 >= 30 && $4 >= 6' | gzip > "$FILTERED_M8"

# ---- Step 7. Optional: global alignment on filtered hits only ----
GLOBAL_RESULT_DB="$TMPDIR_LOCAL/resultDB_global"
GLOBAL_M8="$OUTDIR/out.filtered_global.m8.gz"

# Extract filtered IDs
zcat "$FILTERED_M8" | cut -f1,2 | tr '\t' '\n' | sort -u > "$TMPDIR_LOCAL/filtered_ids.txt"

# ---- Step 8. Cleanup ----
rm -rf "$TMPDIR_LOCAL"

echo "[$(date)] All-vs-all MMseqs2 workflow completed successfully."


Any thoughts or suggestions would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simple peptide sequence comparison #1057

-------------------------------------------------------------------

Parameters that adjust sensitivity. These are 'maxed out' for maximum

sensitivity (and maximum output file size). Tweaking can reduce the outputfile

size if needed:

MIN_SEQ_ID (minimum sequence identity) is between 0 and 100;

SENSITIVITY is between 0 and 10;

MAX_SEQS (max # sequences to match per input sequence) Between 1 and the number of input sequences

-------------------------------------------------------------------

---- Load environment ----

---- Define paths ----

---- Step 1. Convert input table (i.e., text metadata file containing SeqID and peptide sequence) to FASTA (unique headers) ----

---- Step 2. Create MMseqs2 database ----

---- Step 3. Prepare header file for .m8 output ----

---- Step 4. Run all-vs-all search ----

On HPC 'mmseqs search' command needs to be wrapped by a call to srun to avoid a parallization failure (MPI)

---- Step 5. Convert to BLAST-like tabular .m8 (compressed) ----

---- Step 6. Filter self-hits and short/low-identity alignments ----

---- Step 7. Optional: global alignment on filtered hits only ----

Extract filtered IDs

---- Step 8. Cleanup ----

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simple peptide sequence comparison #1057

Description

-------------------------------------------------------------------

Parameters that adjust sensitivity. These are 'maxed out' for maximum

sensitivity (and maximum output file size). Tweaking can reduce the outputfile

size if needed:

MIN_SEQ_ID (minimum sequence identity) is between 0 and 100;

SENSITIVITY is between 0 and 10;

MAX_SEQS (max # sequences to match per input sequence) Between 1 and the number of input sequences

-------------------------------------------------------------------

---- Load environment ----

---- Define paths ----

---- Step 1. Convert input table (i.e., text metadata file containing SeqID and peptide sequence) to FASTA (unique headers) ----

---- Step 2. Create MMseqs2 database ----

---- Step 3. Prepare header file for .m8 output ----

---- Step 4. Run all-vs-all search ----

On HPC 'mmseqs search' command needs to be wrapped by a call to srun to avoid a parallization failure (MPI)

---- Step 5. Convert to BLAST-like tabular .m8 (compressed) ----

---- Step 6. Filter self-hits and short/low-identity alignments ----

---- Step 7. Optional: global alignment on filtered hits only ----

Extract filtered IDs

---- Step 8. Cleanup ----

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions