The following rulegraph was produced by Snakemake and shows the general directed acyclic graph for the steps in the pipeline. Steps are described in detail below.
Download PDF version of rulegraph.Here is a list of all of the steps/rules in ViroMatch’s Snakefile, ordered by occurrence; each step of the pipeline and its purpose is outlined in detail.
The pipeline accepts a uBAM file or paired FASTQ files as --input
from the command line. This step prepares the input file for downstream processing. If a uBAM file is supplied, it is converted to paired FASTQ files. Paired FASTQ input files are simply linked into the output directory for processing.
The paired FASTQ files are adaptor trimmed in this step. Reads will be trimmed for adaptor (--adaptor
), sequences will be trimmed/revised based on end-quality (--endqual
), reads will be filtered based on the percent of N’s allowed post-trimming (--minn
), and too-short reads will be removed (--readlen
).
This step evaluates the resultant FASTQ files produced by the trim_fastq_files
step to see if the FASTQ files are blank or have sequence entries present. Some of the executables in the pipeline will not accept “blank” FASTQ files, so this step prepares the next step in the pipeline if the files have no entries.
The trimmed FASTQ files are evaluated for low-complexity in this step. Low-complexity bases are “hard masked” as N’s in this step. Reads will also be filtered based on the percent of Ns allowed post-trimming (--minn
).
The trimmed and masked paired FASTQ sequences are mapped by nucleotide alignment to a reference host (--host
) genome in this step.
For the paired FASTQ files, the unmapped reads are partitioned into their own BAM files.
The unmapped host BAM files are converted to FASTQ format.
The unmapped host FASTQ files (reads trimmed, masked, and did not map to host genome) are mapped by nucleotide alignment to a virus-only reference database in this step of the pipeline. Reads that map to this database have putative viral identities, but will undergo validation in later steps of the pipeline.
Reads that were not mapped to the viral-only database are partitioned into their own BAM files.
Reads that were not mapped to the viral-only database are written to their own FASTQ files, from the unmapped BAM files.
Reads that were mapped to the viral-only database are partitioned into their own BAM files.
Reads that were mapped to the viral-only database are written to their own FASTQ files, from the mapped BAM files.
This step evaluates the resultant FASTQ files produced by the viral nucleotide mapping steps to see if the FASTQ files are blank or have sequence entries present. Some of the executables in the pipeline will not accept “blank” FASTQ files, so this step prepares the next step in the pipeline if the files have no entries.
The reads (unmapped FASTQ files) that did not map to the viral-only databases by nucleotide alignment are aligned to a six-frame translated version of the same viral-only database by translated nucleotide alignment. Reads that did not map before may align in this manner given the greater sensitivity of the translated alignment approach.
We convert the translated alignment results from their binary format to a human readable tab-delimited format.
We require the read ids in list format for the reads that mapped by translated alignment. This step collects the read ids for translated alignment mapped reads.
Based on the list of read ids for translated alignment mapped reads, we can write FASTQ files for just the reads that were mapped by translated alignment.
In this step, we merge all of the reads that were mapped by (1) nucleotide alignment and (2) translated nucleotide alignment. Cumulatively, these reads are representative of all of the putative viral identity reads from the original sequences input into the pipeline; however, these candidate reads must be further validated by the pipeline.
The candidate viral reads in the merged FASTQ files are now mapped by nucleotide alignment to all of the references in the NCBI nt database. While a read may have a good hit to the viral-only reference database, it may have a better or equally good hit to a non-viral reference in NCBI nt. Aligning to NCBI nt would be prohibitively slow for all input reads, so we only do this validation step for candidate reads that aligned to the virus-only database. NCBI nt is too large to index for alignment, so we have split nt into multiple indexed reference databases for independent alignment. A candidate read will be aligned to all of the split nt databases and a best-hit will be determined by the pipeline downstream.
All of the split nt alignment BAM files are converted to SAM format for the mapped reads in this step.
All of the split nt alignment BAM files are converted to SAM format for the unmapped reads in this step.
For the read-1 FASTQ reads, all of the split nt SAM files for the mapped reads are merged into a single SAM file that is sorted by read id names. The SAM file needs to be sorted by read names because the pipeline’s downstream best hit logic evaluates reads by read blocks: blocks of hits for the same read id.
For the read-2 FASTQ reads, all of the split nt SAM files for the mapped reads are merged into a single SAM file that is sorted by read id names. The SAM file needs to be sorted by read names because the pipeline’s downstream best hit logic evaluates reads by read blocks: blocks of hits for the same read id.
For the read-1 FASTQ reads, we require the read ids in list format for the candidate viral reads that were not mapped to NCBI nt. This step collects the read ids for NCBI nt unmapped reads.
For the read-2 FASTQ reads, we require the read ids in list format for the candidate viral reads that were not mapped to NCBI nt. This step collects the read ids for NCBI nt unmapped reads.
In this step, based on the unmapped read ids collected in previous steps, we pull the candidate reads that did not map to NCBI nt references and merge them into FASTQ files, one for each respective original read pair.
This step evaluates the resultant FASTQ files produced by the NCBI nt mapping steps to see if the FASTQ files are blank or have sequence entries present. Some of the executables in the pipeline will not accept “blank” FASTQ files, so this step prepares the next step in the pipeline if the files have no entries.
The FASTQ files (from step #26) contain the viral candidate reads that did not map to the NCBI nt database by nucleotide alignment. These reads will now be mapped to NCBI nr by translated nucleotide alignment. Reads that did not map before may align in this manner given the greater sensitivity of the translated alignment approach. Aligning to NCBI nr would be prohibitively slow for all input reads, so we only do this validation step for candidate reads that initially aligned to the virus-only database, but did not align to NCBI nt. NCBI nr is too large to index for alignment, so we have split nr into multiple indexed reference databases for independent alignment. A candidate read will be aligned to all of the split nr databases and a best-hit will be determined by the pipeline downstream.
We convert the translated alignment results from their binary format to a human readable tab-delimited format.
For the read-1 FASTQ reads, all of the split nr TSV files for the mapped reads are merged into a single TSV file that is sorted by read id names. The TSV file needs to be sorted by read names because the pipeline’s downstreamn best hit logic evaluates reads by read blocks: blocks of hits for the same read id.
For the read-2 FASTQ reads, all of the split nr TSV files for the mapped reads are merged into a single TSV file that is sorted by read id names. The TSV file needs to be sorted by read names because the pipeline’s downstream best hit logic evaluates reads by read blocks: blocks of hits for the same read id.
In this step, the candidate viral reads that mapped to NCBI nt are evaluated. For each candidate read, all of the associated hits (read block) for the read are evaluated and a best representative hit is chosen. It is during this step that the pipeline writes sanity files which describe the pass/fail status for all of the hits evaluated. A SAM file is written for the chosen viral best hits.
In this step, the candidate viral reads that mapped to NCBI nr are evaluated. For each candidate read, all of the associated hits (read block) for the read are evaluated and a best representative hit is chosen. It is during this step that the pipeline writes sanity files which describe the pass/fail status for all of the hits evaluated. A TSV file is written for the chosen viral best hits.
The pipeline now generates a report for hits to other sequences taxonomy based on the sanity reports from step #32.
The pipeline now generates a report for hits to unknown taxonomy based on the sanity reports from step #32.
The pipeline now generates a report for hits to other sequences taxonomy based on the sanity reports from step #33.
The pipeline now generates a report for hits to unknown taxonomy based on the sanity reports from step #33.
In this step, the nucleotide alignment SAM files from step #32 that contain the best viral hits against NCBI nt are combined into a data matrix that includes associated taxonomy for reporting purposes.
In this step, the translated nucleotide alignment TSV files from step #33 that contain the best viral hits against NCBI nt are combined into a data matrix that includes associated taxonomy for reporting purposes.
This step generates a report file that describes viral read counts based on nucleotide alignments at genus, species, and full lineage levels.
This step generates a report file that describes viral read counts based on translated nucleotide alignments at genus, species, and full lineage levels.
The report from step #40 is copied into the main output directory for review.
The report from step #41 is copied into the main output directory for review.
The report from steps #34 and #35 are copied into the main output directory for review.
The report from steps #36 and #37 are copied into the main output directory for review.