Skip to content

Latest commit

 

History

History
139 lines (92 loc) · 9.41 KB

File metadata and controls

139 lines (92 loc) · 9.41 KB

Introduction

The unseen microbial world impacts us every day. From causing to preventing disease, spoiling or flavouring food, or helping ruminant animals digest grass. Bacteria are an important part of the world around us and understanding the various types and behaviors can overall improve our food and medical systems. A vast majority of environmental microbiologists claim that only less than 2% of all bacteria in the environment can be cultured within laboratory conditions. Such a vast divide of being unable to culture a vast majority of bacteria in the environment might be due to lack of culturing techniques, need for specific nutrients that are unable to produce within laboratory conditions, or dependence on interactions with other bacteria in order to live and grow. Thus, many bacteria get missed in traditional petri plate-based studies.

Metagenomics is the study of genetic material (or genomes) extracted directly from all environmental samples. Such studies are in contrast with a traditional reduction-based approach in microbiology wherein a specific strain in consideration is isolated, purified and eventually its genome sequenced. Metagenomics helps us understand the ecology of the bacteria living within it, analyse them in their natural state and finally, understand the importance in human as well as animal health.

16s rRNA is the most common structural-based metagenomics experiment one can conduct for an observed community of bacteria. 16s rRNA-based experiments helps one perform a general survey of "what kinds of bacteria are available within the community and by how many?". One of the key advantage of considering a 16s rRNA-based experiment is that such a region is universally conserved amongst almost all bacteria yet has enough variability to distinguish various populations across various samples. One can think of it like a "fingerprint" for microbes.

Molecule of a 30S Subunit from thermus thermophilus 16S ribosomal RNA

Source: Wikipedia

Table of Contents

Input Data

The input data can be passed to 16SMaRT in two different ways using the input argument, either:

  • a comma-seperated datasheet containing NCBI SRA IDs (or a URL to a CSV file).
  • a directory containing a list of FASTQ files.

CSV DataSheet

The CSV DataSheet must be of the following format.

Column Description
group A group of FASTQ files (or a study).
sra NCBI SRA ID
layout Single-End or Paired-End Sequence (values: single, paired)
primer_f Forward Primer
primer_r Reverse Primer
trimmed whether this sequence has already been trimmed or not. (values: true, false)
min_length start length used to screen a sequence.
max_length end length used to screen a sequence.

For example, take a look at a sample.csv used in our sample pipeline. You can then provide the parameter as follows:

input="/work/input.csv|<YOUR_URL_TO_CSV_FILE>"

Each SRA ID is then fetched and the FASTQ files are saved onto disk within your data directory.

Quality Control

16SMaRT uses FASTQC and MultiQC for Quality Control. By default, this is done right after reading FASTQ files. The output results of FASTQC for each FASTQ file can be obtained within the <data_dir>/fastqc whereas the MultiQC report can be obtained at <data_dir>/multiqc_report.html file.

Quality Control can be disabled by simply providing the parameter as follows:

fastqc=False; multiqc=False

Preprocessing

Key Type Default
trim_chunks integer Number of group configurations to run parallely during trimming (default - 8).
quality_average integer Calculate the average quality score for each sequence and remove those that have an average below the value provided. (default - 35)
maximum_ambiguity integer mothur's maxambig parameter called during trim.seqs (default - 0).
maximum_homopolymers integer mothur's maxhomop parameter called during trim.seqs (default - 8).
primer_difference integer mothur's pdiffs parameter called during trim.seqs (default - 5).
classification_cutoff integer mothur's pdiffs parameter called during trim.seqs (default - 80).
cutoff_level float The cutoff parameter allows you to specify a consensus confidence threshold for your taxonomy (default - 0.03).
filter_taxonomy array Taxonomy to be removed (default - ["chloroplast", "mitochondria", "archaea", "eukaryota", "unknown"]).
taxonomy_level integer mothur's taxlevel parameter called during trim.seqs (default - 6).
silva_pcr_start integer Start length when performing a PCR over SILVA DB.
silva_pcr_end integer End length when performing a PCR over SILVA DB.
silva_version string SILVA Version to be downloaded. Available versions are listed here (default - 132).
minimal_output boolean A minimal output optimizes the entire pipeline to utilize minimal disk resources (i.e., all intermediate resources will be deleted) (default - False).
jobs integer Number of jobs to use while performing a pipeline run. (default - number of CPUs)

Diversity Analysis

Abundance Chart

Raw Data Rarified Data

Alpha Diversity

Alpha diversity is a metric that describes the diversity or richness of the bacterial community of a sample. These indices approximate the number of different species or operational taxonomic units (OTUs) present. Alpha diversity can be estimated in a variety of ways, this pipeline considers Observed, Chao1, ACE, Shannon, Simpson, Inverse-Simpson and Fisher estimates. Each of these metrics has different assumptions or weaknesses, so all of them have been considered. For example, Observed metrics simply use the raw observed counts tabulated from rarified or non-rarefied sequence data whereas the Chao1 primarily considers total “richness” of the various species, the number of “rare” taxa only occurring once or twice and assumes a Poisson distribution. Bacterial communities do generally follow this distribution, as there are many “rare” taxa in nature, so this metric is often preferred in microbial community analysis as it addresses potential skew. Shannon metrics provide additional information, as they consider both “richness” and “evenness”, the proportion of each species compared to the total number of species.

Raw Data Rarified Data

Beta Diversity

Beta diversity estimates give researchers a methodology to describe differences between samples in microbial communities across samples, or how “related” one community is to another. “Relatedness” can be described using information from phylogenetic distances, individual taxa occurrence rate, or both. Currently, this pipeline utilizes Bray-Curtis ordination. This methodology is commonly used in ecology studies and utilizes only the abundances of each species across the samples to calculate their dissimilarity. It does not require phylogenetic distances that may take valuable computational storage space/processing time to generate. Future optimization will consider the Weighted UniFrac distance, a similarity index that does use the phylogeny “weighed” by occurrence rate.

Raw Data Rarified Data