Home > Pipeline Overview > Reports and Output Files

Reports and Output Files

Top-Level Structure

When the ViroMatch pipeline finishes, all output will be in the directory specified by the --output argument. Contents of this area will look like the following.

CONFIG.yaml
REPORT.nuc_ambiguous_counts.txt
REPORT.nuc_counts.txt
REPORT.trans_nuc_ambiguous_counts.txt
REPORT.trans_nuc_counts.txt
Snakefile
cmd.sh
stats.log
steps.txt
viromatch_results/

Here is a breakdown of this area.

File/Directory	Description
.snakemake	This directory contains information written by Snakemake during its execution.
CONFIG.yaml	This file contains all of the configuration information required to run ViroMatch and is used directly by Snakmake to execute the pipeline. Command line parameters are captured here.
REPORT.nuc_ambiguous_counts.txt	Final virus read counts and taxonomy based on nucleotide reference mappings. See below for detailed explanation.
REPORT.nuc_counts.txt	Final virus read counts and taxonomy based on translated nucleotide reference mappings. See below for detailed explanation.
REPORT.trans_nuc_ambiguous_counts.txt	Rejected hits/taxonomy based on nucleotide reference mappings. See below for detailed explanation
REPORT.trans_nuc_counts.txt	Rejected hits/taxonomy based on translated nucleotide reference mappings. See below for detailed explanation.
Snakefile	The Snakemake file that executes the steps in the ViroMatch pipeline. Can also be executed using Snakemake directly.
cmd.sh	The shell script that ViroMatch uses to automatically execute the Snakemake pipeline.
stats.log	Upon completion of the pipeline, Snakemake will write a benchmark file outlining runtime per pipeline step.
steps.txt	The rules (steps) in the pipeline, as executed by Snakemake.
viromatch_results/	This directory contains all of the ancillary directories and files ViroMatch writes during pipeline processing.

The viromatch_results/ Directory

This directory contains all of the ancillary directories and files ViroMatch writes during pipeline processing. Each rule (step) in the pipeline will have its own directory written here. The directory structure of a successful run will looks as follows.

viromatch_results/.viromatch/
viromatch_results/blank_eval_filter_low_complexity/
viromatch_results/blank_eval_validate_trans_nuc_nr/
viromatch_results/blank_eval_viral_trans_nuc/
viromatch_results/copy_nuc_ambiguous_report/
viromatch_results/copy_nuc_nt_report/
viromatch_results/copy_trans_nuc_ambiguous_report/
viromatch_results/copy_trans_nuc_nr_report/
viromatch_results/filter_low_complexity_fastq_files/
viromatch_results/host_screen_mapping/
viromatch_results/host_screen_write_unmapped_bam/
viromatch_results/host_screen_write_unmapped_fastq/
viromatch_results/nuc_nt_best_hit_count_prep/
viromatch_results/nuc_nt_best_hit_counts/
viromatch_results/nuc_nt_best_hit_filter_sam/
viromatch_results/nuc_nt_otherseq_hit_report/
viromatch_results/nuc_nt_unknown_hit_report/
viromatch_results/prep_fastq_files/
viromatch_results/trans_nuc_nr_best_hit_count_prep/
viromatch_results/trans_nuc_nr_best_hit_counts/
viromatch_results/trans_nuc_nr_best_hit_filter_tsv/
viromatch_results/trans_nuc_nr_otherseq_hit_report/
viromatch_results/trans_nuc_nr_unknown_hit_report/
viromatch_results/trim_fastq_files/
viromatch_results/validate_nuc_nt_mapping/
viromatch_results/validate_nuc_nt_merge_r1_mapped_sam/
viromatch_results/validate_nuc_nt_merge_r2_mapped_sam/
viromatch_results/validate_nuc_nt_write_mapped_sam/
viromatch_results/validate_nuc_nt_write_merged_unmapped_fastq/
viromatch_results/validate_nuc_nt_write_r1_unmapped_ids/
viromatch_results/validate_nuc_nt_write_r2_unmapped_ids/
viromatch_results/validate_nuc_nt_write_unmapped_sam/
viromatch_results/validate_trans_nuc_nr_mapping/
viromatch_results/validate_trans_nuc_nr_mapping_daa_to_tsv/
viromatch_results/validate_trans_nuc_nr_merge_r1_mapped_tsv/
viromatch_results/validate_trans_nuc_nr_merge_r2_mapped_tsv/
viromatch_results/viral_mapped_fastq_merge/
viromatch_results/viral_nuc_mapping/
viromatch_results/viral_nuc_write_mapped_bam/
viromatch_results/viral_nuc_write_mapped_fastq/
viromatch_results/viral_nuc_write_unmapped_bam/
viromatch_results/viral_nuc_write_unmapped_fastq/
viromatch_results/viral_trans_nuc_daa_to_tsv/
viromatch_results/viral_trans_nuc_extract_mapped_ids/
viromatch_results/viral_trans_nuc_mapping/
viromatch_results/viral_trans_nuc_write_mapped_fastq/

ViroMatch writes several internal files in the .viromatch directory during processing. Information collected here includes benchmark files for individual steps and log files for steps that run executables that generate their own output.

As ViroMatch progresses through pipeline execution, each step will create an underlying directory based on the step’s name and processing specific to the step will occur in this area. For example, if we wanted to see the exact commands related to the viral_nuc_mapping step, we would look in the viromatch_results/viral_nuc_mapping/ directory.

viromatch_results/viral_nuc_mapping/INPUT.r1.viral.sam.cmd
viromatch_results/viral_nuc_mapping/INPUT.r2.viral.sam.cmd

There are two shell scripts, one for R1 reads and another for R2 reads, that run the specific commands to align FASTQ to the viral nucleotide reference database using BWA-MEM.

cat viromatch_results/viral_nuc_mapping/INPUT.r1.viral.sam.cmd

Results:

bwa mem /viralfna/2014_12_29_complete_viral_genomes.fasta viromatch_results/host_screen_write_unmapped_fastq/INPUT.r1.host.unmapped.fastq > viromatch_results/viral_nuc_mapping/INPUT.r1.viral.sam 2> viromatch_results/.viromatch/log/INPUT.r1.viral.sam.log

cat viromatch_results/viral_nuc_mapping/INPUT.r2.viral.sam.cmd

Results:

bwa mem /viralfna/2014_12_29_complete_viral_genomes.fasta viromatch_results/host_screen_write_unmapped_fastq/INPUT.r2.host.unmapped.fastq > viromatch_results/viral_nuc_mapping/INPUT.r2.viral.sam 2> viromatch_results/.viromatch/log/INPUT.r2.viral.sam.log

All ViroMatch processing is done by Snakemake executing individual shell scripts along the pipeline. Therefore, it is relatively easy to see exactly what commands are being run throughout the pipeline. Once a run is finished, you can list all of the underlying shell scripts with the following command.

find viromatch_results/* -type f | grep '.cmd$'

To see the order of execution for the shell commands, you will just need to look at the Snakemake log files that were generated during pipeline execution.

ls .snakemake/log/*

.snakemake/log/2020-10-27T183639.808528.snakemake.log
.snakemake/log/2020-10-27T183640.109481.snakemake.log

One of the logs simply lists the rules/steps in the pipeline in order of execution defined by Snakemake, the other log is written during actual pipeline line execution. You will see the order of execution here for each rule, the associated directory being writte under viromatch_results/, input and output files for the rule, and the specific shell script being used for execution.

Sanity Files

By default, ViroMatch removes temporary files generated during execution when they are no longer needed for downstream processing. This is done to save disk space, as many of the temporary files can be large in size. Therefore, not all of the files generated during processing will be under viromatch_results/ unless the user specifies the --keep argument when executing the pipeline.

Using the --keep switch when executing the pipeline retains all of the files generated during processing. There are several useful “sanity” files that are generated within the pipeline , but be warned they can be very large in size!

While retaining all of the pipeline files provides additional information for every command executed in the pipeline, some files are more important than others in reviewing pipeline decisions. Of particular interest are the pass/fail sanity files. These files provide the pass/fail status for every read evaluated in the pipeline, including the reason why a read might fail. While the read count report files provide viral hit/taxonomy counts, the pass/fail sanity files provide information on why or why not a read was counted.

The pass/fail sanity files are located here.

viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r1.validate.nuc.mapped.filter.pass.sam.log
viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r1.validate.nuc.mapped.filter.pass.sam.log.unknown
viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r1.validate.nuc.mapped.filter.pass.sam.log.otherseq

viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r2.validate.nuc.mapped.filter.pass.sam.log
viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r2.validate.nuc.mapped.filter.pass.sam.log.unknown
viromatch_results/nuc_nt_best_hit_filter_sam/INPUT.r2.validate.nuc.mapped.filter.pass.sam.log.otherseq

viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r1.validate.trans.nuc.mapped.filter.pass.tsv.log
viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r1.validate.trans.nuc.mapped.filter.pass.tsv.log.unknown
viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r1.validate.trans.nuc.mapped.filter.pass.tsv.log.otherseq

viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r2.validate.trans.nuc.mapped.filter.pass.tsv.log
viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r2.validate.trans.nuc.mapped.filter.pass.tsv.log.unknown
viromatch_results/trans_nuc_nr_best_hit_filter_tsv/INPUT.r2.validate.trans.nuc.mapped.filter.pass.tsv.log.otherseq

There are sanity files for both nucleotide mapping and translated nucleotide mapping, both broken down into R1 and R2 reads.

The *.unknown and *.otherseq sanity files partition ambiguous hits that have been removed from consideration. See Ambiguous Counts for more details.

Sanity Examples

Using INPUT.r1.validate.nuc.mapped.filter.pass.sam.log as an example, the associated fields in the file are as follows.

Field	Description
pass/fail	Pass/fail status for the hit. If passed, the read/hit was counted as a viral identity.
code	Discrete pass/fail code, associated with the best hit logic in the pipeline.
read block size	How many hits per read were considered for the read.
read id	Associated sequence id for the read (from uBAM or FASTQ files).
comment	Comment related to the pass/fail status.
pid	Percent identity variance of the read (query) as compared to the reference hit (subject).
acc id	Accession id of the reference hit.
species	Species of the reference hit.
lineage	Full lineage of the reference hit.

An example of a single failed hit.

Field	Description
pass/fail	FAIL
code	TIED BEST HIT
read block size	2
read id	D00170:57:CA2R8ANXX:4:2315:8200:2670
comment	failed best hit (tied)
pid	0.0397
acc id	KF294862.1
species	Gyrovirus Tu789
lineage	Viruses –> Anelloviridae –> Gyrovirus –> unclassified Gyrovirus –> Gyrovirus Tu789

In the above example, the sequence read D00170:57:CA2R8ANXX:4:2315:8200:2670 had a hit to the KF294862.1 reference genome (Gyrovirus Tu789) with 3.97% variance. While this is an acceptable hit — e.g. hits a known virus with acceptable percent identity — there are 2 hits associated with this read. The other hit also was equally acceptable. In such cases, when hits are equivalent, the pipeline randomly chooses the best hit. As indicated, this hit was (randomly) failed and the other hit was the best hit for the read.

Let’s look at a more complicated example. Here is a read block (i.e. all the hits being evaluated for a single read).

Pass/Fail	Code	Read Block Size	Read ID	Comment	PIDV	Accc ID	Species	Lineage
PASS	BEST HIT	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	best hit	0.0066	MH648893.1	Anelloviridae sp.	Viruses –> Anelloviridae –> unclassified Anelloviridae –> Anelloviridae sp.
FAIL	NEIGHBOR	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	viral neighbor hit	0.0199	KM593803.2	SEN virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> SEN virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.0993	FM882010.1	Torque teno virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> Torque teno virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.1523	AY206683.1	SEN virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> SEN virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.2914	AB059353.1	SEN virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> SEN virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.5497	GQ179972.1	SEN virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> SEN virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.5686	MK820646.1	Torque teno virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> Torque teno virus
FAIL	SECONDARY NN	8	A00584:317:HG3LNDSXY:4:1404:2465:7639	secondary, non-neighbor hit	0.6291	AB856070.1	SEN virus	Viruses –> Anelloviridae –> unclassified Anelloviridae –> SEN virus

There are 8 total hits for the read, all hitting viral taxonomy; therefore, the best hit is chosen based on the best percent identity variance (pidv) value (0.0066 for MH648893.1). Note, the second best hit has a pidv value of 0.0199 to a viral identity, within the the default --pidprox value of 0.04 for “neighbor” status, so the failure code is labeled NEIGHBOR. The other viral hits are too distant from the --pidprox value, so their failure codes are SECONDARY NN for secondary, non-neighbor hits.

In the next example, we have a read block with 13 total hits.

Pass/Fail	Code	Read Block Size	Read ID	Comment	PIDV	Accc ID	Species	Lineage
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.007	LT908445.1	Spodoptera aff. frugiperda 2 RZ-2014	cellular organisms –> Eukaryota –> Opisthokonta –> Metazoa –> Eumetazoa –> Bilateria –> Protostomia –> Ecdysozoa –> Panarthropoda –> Arthropoda –> Mandibulata –> Pancrustacea –> Hexapoda –> Insecta –> Dicondylia –> Pterygota –> Neoptera –> Holometabola –> Amphiesmenoptera –> Lepidoptera –> Glossata –> Neolepidoptera –> Heteroneura –> Ditrysia –> Obtectomera –> Noctuoidea –> Noctuidae –> Amphipyrinae –> Spodoptera –> Spodoptera aff. frugiperda 2 RZ-2014
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.007	M63414.1	Orgyia pseudotsugata single capsid nuclopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> unclassified Alphabaculovirus –> Orgyia pseudotsugata single capsid nuclopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.007	U75930.2	Orgyia pseudotsugata multiple nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Orgyia pseudotsugata multiple nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.0141	KP747440.1	Dasychira pudibunda nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Dasychira pudibunda nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.4859	EF207986.1	Antheraea pernyi nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Antheraea pernyi nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.4859	KY979487.1	Antheraea pernyi nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Antheraea pernyi nucleopolyhedrovirus –> Antheraea proylei nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.4859	LC194889.1	Antheraea pernyi nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Antheraea pernyi nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.4859	LC375537.1	Antheraea yamamai nucleopolyhedrovirus	Viruses –> Baculoviridae –> unclassified Baculoviridae –> Antheraea yamamai nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.4859	MH797002.1	Antheraea pernyi nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Antheraea pernyi nucleopolyhedrovirus –> Antheraea proylei nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.6831	KJ631623.1	Condylorrhiza vestigialis MNPV	Viruses –> Baculoviridae –> Alphabaculovirus –> unclassified Alphabaculovirus –> Condylorrhiza vestigialis MNPV
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.7746	AF368905.1	Anticarsia gemmatalis nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> unclassified Alphabaculovirus –> Anticarsia gemmatalis nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.7746	DQ813662.2	Anticarsia gemmatalis multiple nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Anticarsia gemmatalis multiple nucleopolyhedrovirus
FAIL	RB AMBIGUITY	13	A00584:317:HG3LNDSXY:4:2646:2248:14935	significant ambiguous, non-viral hit	0.7746	MG746625.1	Anticarsia gemmatalis multiple nucleopolyhedrovirus	Viruses –> Baculoviridae –> Alphabaculovirus –> Anticarsia gemmatalis multiple nucleopolyhedrovirus

The hits above are a mixture of viral and non-viral identities. We have a tie for best pidv, namely 0.007 for hits to LT908445.1, M63414.1, and U75930.2. The LT908445.1 hit is to a non-viral species, therefore the entire read block is failed and the failure code RB AMBIGUITY for significant ambiguous, non-viral hit is applied to all of the hits.

The above examples walk through some of the failure codes encountered in the sanity files. See the Failure Codes section below for details regarding other hit failures.

Failure Codes

All possible pass/failure codes for the best hit logic portion of the pipeline are outlined below.

Pass/Fail Status	Code	Comment	Description
FAIL	NEIGHBOR	viral neighbor hit	The hit was viral in nature, the score was within the `--pidprox` or `--bitprox` value range, but another viral hit was chosen based on a better score.
FAIL	RB AMBIGUITY	significant ambiguous, non-viral hit	If the best hits (or any neighbor hits within the `--pidprox` or `--bitprox` value range) are not viruses, we fail the entire read block.
FAIL	RB PID	best score X is > Y	The best score in the read block is greater than the `--pid` value; we fail the entire read block. Specific to nucleotide alignments.
FAIL	SECONDARY NN	secondary, non-neighbor hit	The hit is failed because it is greater than the `--pidprox` or `--bitprox` value.
FAILED	TIED BEST HIT	failed best hit (tied)	The hit is viral and was tied for the best hit, but was not chosen during the random best hit selection.
IGNORED	OTHER SEQ	taxonomy matches ‘other sequences’, investigate	If a hit’s lineage matches NCBI’s other sequences category, we ignore the hit, but quantify and report how many hits are affected.
IGNORED	UNKNOWN TAXA	superkingdom is unknown, investigate	If a hit’s superkingdom is unknown or unclassified, we ignore the hit, but quantify and report how many hits are affected.
PASS	BEST HIT	best hit	The hit is viral and the single best (non-tied) hit.
PASS	RANDOM BEST HIT	randomly chosen best hit (tied)	The hit is viral and was tied for the best hit, and was randomly chosen within the tied best hits.

Taxonomy/Quantification Reports

The ultimate purpose of the ViroMatch pipeline is to review metagenomic sequence reads and report hits to known viruses. Upon completion, ViroMatch will provide reports detailing viral taxonomic classification and quantification. All report files are at the top-level of the --outdir directory provided in the execution command, specifically:

REPORT.nuc_counts.txt
REPORT.trans_nuc_counts.txt
REPORT.nuc_ambiguous_counts.txt
REPORT.trans_nuc_ambiguous_counts.txt

These reports are discussed in more detail below.

Nucleotide Counts

The REPORT.nuc_counts.txt file reports read identities to viruses as provided by the nucleotide mapping portions of the ViroMatch pipeline. Reads tallied in this report have been assessed by alignment to the viral-only reference database, validation of candidate viral reads against the NCBI nt reference database, taxonomic classification, and finally assessment by best hit logic. Only reads that have passed all of these steps are considered viral identities.

Download Example Nucleotide Counts Report

The nucleotide counts report file consists of the following sections:

Section	Description
Header	The header section captures the ViroMatch configuration for the process that produced the attached report information. Pipeline parameters are listed here.
Lineage (R1 + R2)	The lineage section provides a breakdown of viral read counts at the full lineage level — i.e. all taxonomic categories as provided by NCBI. Fields for this table are read count, percent of reads represented, and full lineage. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined.
Genus (R1 + R2)	The genus section provides a breakdown of viral read counts at genus-level taxonomy. Fields for this table are read count, percent of reads represented, and genus. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined. This is the same cumulative total as listed in the lineage section, but broken down by genus.
Species (R1 + R2)	The species section provides a breakdown of viral read counts at species-level taxonomy. Fields for this table are read count, percent of reads represented, and species. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined. This is the same cumulative total as listed in the lineage and genus sections, but broken down by species.
Lineage (R1)	The same information as outlined in the Lineage (R1 + R2) section, but representative of only the contributing R1 reads.
Genus (R1)	The same information as outlined in the Genus (R1 + R2) section, but representative of only the contributing R1 reads.
Species (R1)	The same information as outlined in the Species (R1 + R2) section, but representative of only the contributing R1 reads.
Lineage (R2)	The same information as outlined in the Lineage (R1 + R2) section, but representative of only the contributing R2 reads.
Genus (R2)	The same information as outlined in the Genus (R1 + R2) section, but representative of only the contributing R2 reads.
Species (R2)	The same information as outlined in the Species (R1 + R2) section, but representative of only the contributing R2 reads.

Translated Nucleotide Counts

The REPORT.trans_nuc_counts.txt file reports read identities to viruses as provided by the translated nucleotide mapping portions of the ViroMatch pipeline. Reads tallied in this report have been assessed by alignment to the viral-only translated reference database, validation of candidate viral reads against the NCBI nr reference database, taxonomic classification, and finally assessment by best hit logic. Only reads that have passed all of these steps are considered viral identities.

Download Example Translated Nucleotide Counts Report

Translated nucleotide viral identities are reads that failed to map significantly during the nucleotide mapping sections of the pipeline but have identity to translated nucleotide viral references; therefore, these counts are separate and distinct from those listed in the REPORT.trans_nuc_counts.txt report. Adding together the counts in the REPORT.nuc_counts.txt and REPORT.trans_nuc_counts.txt files gives the total viral read counts for a given sample. As reads are given the opportunity to map to nucleotide references before translated nucleotide references, translated nucleotide counts are often much lower in count when compared to nucleotide reference counts.

The translated nucleotide counts report file consists of the following sections:

Section	Description
Header	The header section captures the ViroMatch configuration for the process that produced the attached report information. Pipeline parameters are listed here.
Lineage (R1 + R2)	The lineage section provides a breakdown of viral translated read counts at the full lineage level — i.e. all taxonomic categories as provided by NCBI. Fields for this table are read count, percent of reads represented, and full lineage. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined.
Genus (R1 + R2)	The genus section provides a breakdown of viral translated read counts at genus-level taxonomy. Fields for this table are read count, percent of reads represented, and genus. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined. This is the same cumulative total as listed in the lineage section, but broken down by genus.
Species (R1 + R2)	The species section provides a breakdown of viral translated read counts at species-level taxonomy. Fields for this table are read count, percent of reads represented, and species. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined. This is the same cumulative total as listed in the lineage and genus sections, but broken down by species.
Lineage (R1)	The same information as outlined in the Lineage (R1 + R2) section, but representative of only the contributing R1 reads.
Genus (R1)	The same information as outlined in the Genus (R1 + R2) section, but representative of only the contributing R1 reads.
Species (R1)	The same information as outlined in the Species (R1 + R2) section, but representative of only the contributing R1 reads.
Lineage (R2)	The same information as outlined in the Lineage (R1 + R2) section, but representative of only the contributing R2 reads.
Genus (R2)	The same information as outlined in the Genus (R1 + R2) section, but representative of only the contributing R2 reads.
Species (R2)	The same information as outlined in the Species (R1 + R2) section, but representative of only the contributing R2 reads.

Ambiguous Counts

The REPORT.nuc_ambiguous_counts.txt and REPORT.trans_nuc_ambiguous_counts.txt files report ambiguous hit identities encountered by the nucleotide and translated nucleotide mapping portions of the ViroMatch pipeline.

Download Example Nucleotide Ambiguous Report Download Example Translated Nucleotide Ambiguous Report

In early tests of ViroMatch, we noticed that NCBI had questionable taxonomic classifications for some of their reference sequences, leading to misleading identities. Some reads were aligning to references that were: 1) reference sequences that NCBI designates as unclassified sequences; 2) other sequences consisting of cloning vector, synthetic constructs, or other artificial sequences.

When a read has a hit to a reference whose superkingdom is Unknown, we can’t evaluate if the hit is a virus or non-viral, so we ignore/skip the hit; however, we quantify and report how many hits are affected in this manner from all of the reviewed sequences. The other sequences category has reference entries with partial viral sequences submitted as part of cloning vectors etc., which can lead to false negatives during the best hit logic phase of the pipeline. If a hit’s lineage matches NCBI’s other sequences category, we ignore the hit, but quantify and report how many hits are affected. These categories are now partitioned during ViroMatch classification reports as uknown and ambiguous hit counts, respectively.

Ambiguous hit counts are reported for reference, but are never considered in the pipeline past the point of their initial taxonomic identification.

It is important to note that a single read can have hits to multiple references. During the best hit logic portion of the pipeline, we evaluate all of the hits for a single read as a read block, or all of the hits for the read. Removing hits for either other sequences or unknown taxonomy removes only those hits from the read block and leaves the other hits for further evaluation; therefore, ambiguous hit removal does not necessarily mean the read has been failed.

The ambiguous nucleotide hit counts report files consist of the following sections:

Section	Description
Other Sequences Header	The header section captures the ViroMatch configuration for the process that produced the attached report information. Pipeline parameters are listed here.
Other Sequences Lineage (R1 + R2)	The lineage section provides a breakdown of hit counts related to other sequences at the full lineage level — i.e. all taxonomic categories as provided by NCBI. Fields for this table are hit count, percent of reads represented, and full lineage. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined.
Other Sequences Lineage (R1)	The same information as outlined in the Other Sequences Lineage (R1 + R2) section, but representative of only the contributing R1 reads.
Other Sequences Lineage (R2)	The same information as outlined in the Other Sequences lineage (R1 + R2) section, but representative of only the contributing R2 reads.
Unknown Header	The header section captures the ViroMatch configuration for the process that produced the attached report information. Pipeline parameters are listed here.
Unknown Lineage (R1 + R2)	The lineage section provides a breakdown of hit counts related to unknown at the full lineage level — i.e. all taxonomic categories as provided by NCBI. Fields for this table are hit count, percent of reads represented, and full lineage. A cumulative total is also provided. For this section, counts are derived from both read pairs (R1 & R2) combined.
Unknown Lineage (R1)	The same information as outlined in the Unknown Lineage (R1 + R2) section, but representative of only the contributing R1 reads.
Unknown Lineage (R2)	The same information as outlined in the Unknown lineage (R1 + R2) section, but representative of only the contributing R2 reads.