Overview

What is VAPER?
Inputs
Reference Selection
1. Automated Reference Selection
  1. Reference Selection Modes
  2. Reference Set Filtering
    1. Sample-Level Filters
    2. Run-Level Filters
2. Manual Reference Selection
  1. Sample-Level Reference Files
  2. Run-Level Reference Files
Genome Assembly
Metagenomic Classification
1. Example Outputs

What is VAPER?

Flowchart

Key Features

VAPER (Viral Assembly from Probe-based Enrichment) is a viral (meta-)assembly pipeline that can:

🧬 Assemble genomes from complex samples, supporting multiple assemblies per sample (e.g., co-infections)
🧬 Automatically detect and select reference genomes
🧬 Predict the taxonomy of each assembly, with an optional viral metagenomic summary
🧬 Export reads associated with each assembly for downstream use

While VAPER was originally designed for hybrid capture data, it has also been used with shotgun metagenomic and tile-amplicon data. It comes stock with a comprehensive reference set for [loading] viral taxa, including all species targeted by the Illumina VSP v2.0 panel (see the full list here). Keep on reading to learn more!

Contributors

VAPER was originally created by the Washington State Department of Health (WA DOH) as part of the Pathogen Genomics Center of Excellence (PGCoE). Check out the links below to learn more:

Inputs

Read Downsampling

Reads can be optionally downsampled using the --max_reads parameter (default 2_000_000). This is accomplished using [seqtk][https://github.com/lh3/seqtk] and is primarily intended to control workflow efficiency. That said, this improved efficiency comes with a potential loss sensitivity, particularly during the reference selection stage. If you anticipate your target organism is present in a sample at low relative read depth, it is recommended that you use --ref_mode sensitive, which increases max_reads to 2 trillion reads (among other things).

Human Read Scrubbing

Human reads can be optionally removed using the --scrub_reads parameter (default false). This is accomplished using the SRA Human Read Scubber. Scrubbed reads can be found in each sample output directory: ${outdir}/${sample}/reads/

Read Quality

Read quality is evaluated and managed using FastQC and fastp. Fastp metrics are reported in the final summary. FastQC metrics can be found in the MultiQC report.

Downloading Reads from SRA

VAPER can download paired-end reads from the NCBI Sequence Read Archive (SRA) (see example below).

samplesheet.csv

sample,sra
sample01,SRR28460430

Reference Selection

VAPER can automatically select references for you and/or you can tell VAPER which references to use. Learn more about how to adjust reference selection parameters here. VAPER comes stock is a default reference set that includes thousands of viral species (see the full list here).

Automated Reference Selection

VAPER comes with multiple automated reference selection modes: --refs_mode ( standard | sensitive | kitchen-sink ). standard and sensitive modes use a reference set supplied using the --ref_set parameter. kitchen-sink mode downloads references from NCBI using the metagenomic summary.

Learn more about how reference sets are created here

Reference Selection Modes

Standard Mode

standard reference selection mode works as follows:

A de novo assembly is created using Shovill

Contigs are mapped to the reference set using Minimap2 (reference filters are applied at this stage - see next section)

References covered by at least one contig are compared to each other with Sourmash and then clustered using DBSCAN and the nucleotide divergence threshold set by ref_dist.

The reference with the greatest sample coverage is selected for each cluster and returned if its genome fraction meets the minimum set by ref_genfrac.

parameter value

max_reads 2_000_000

ref_genfrac 0.5

ref_denovo_contigcov 10

ref_denovo_contiglen 300

ref_denovo_depth 30

parameter	value
max_reads	2_000_000
ref_genfrac	0.5
ref_denovo_contigcov	10
ref_denovo_contiglen	300
ref_denovo_depth	30

Sensitive Mode

sensitive reference selection uses the same method as standard mode but with parameters adjusted to improve the detection of low abundance targets. This increased sensitivity is often met with a large decrease in efficiency (run times may increase dramatically! ⚠️).

parameter value

max_reads 2_000_000_000_000

ref_genfrac 0.1

ref_denovo_contigcov 0

ref_denovo_contiglen 0

ref_denovo_depth 1000

parameter	value
max_reads	2_000_000_000_000
ref_genfrac	0.1
ref_denovo_contigcov	0
ref_denovo_contiglen	0
ref_denovo_depth	1000

Kitchen-Sink Mode 🚽

kitchen-sink mode attempts to build assemblies for each taxon identified in the metagenomic summary. Genome assemblies associated with each taxon are downloaded from NCBI using the NCBI datasets tool. Only assemblies that are listed as complete are included and those containing multiple contigs are split into individual FASTA files to avoid concatenating segmented viruses. As you can imagine, this method of reference selection can be very unreliable and should therefore be used with caution ⚠️.

accurate and fast reference selection modes used in VAPER v1.0 are now depricated. standard mode is like accurate mode but with more bells and whistles 🔔.

Reference Set Filtering

You can specify which references to select from using the ref_* filter columns / parameters. These filters can be set in the samplehseet (applied per sample) or the command line (applied to all samples). Below are examples of each method:

Sample-Level Filters

The example below shows how you would direct VAPER to only perform reference selection using Alphainfluenzavirus influenzae references for sample01. This filter would not be applied to sample02. You could likewise filter by taxon, segment, and reference name using the ref_taxon, ref_segment, and ref_name columns. samplesheet.csv:
sample,fastq_1,fastq_2,ref_species
sample01,sample01_R1.fq.gz,sample01_R2.fq.gz,Alphainfluenzavirus influenzae
sample02,sample02_R1.fq.gz,sample02_R2.fq.gz,

Run-Level Filters

The example below shows how you would direct VAPER to only perform reference selection using Alphainfluenzavirus influenzae references for all samples on the run. You could filter by taxon, segment, or reference name using the --ref_taxon, --ref_segment, or --ref_name parameters.
nextflow run doh-jdj0303/vaper \
  -r v2.0 \
  -profile docker \
  --input samplesheet.csv \
  --outdir results \
  --ref_species "Alphainfluenzavirus influenzae"

The ref_taxon, ref_species, and ref_segment filters control which references are considered during the selection process, whereas references specified by the ref_name filters will always be included, regardless of if there are reads to create an assembly.

Manual Reference Selection

References can also be supplied manually as individual file paths with or without an existing reference set. Like the reference set filters, this can be accomplished via the samplesheet (sample level) or the command line (run level). Multiple references can be supplied as a semicolon separated list. See below for examples:

Sample-Level Reference Files

samplesheet.csv:

sample,fastq_1,fastq_2,ref_file
sample01,sample01_R1.fq.gz,sample01_R2.fq.gz,/path/to/reference.fa.gz
sample02,sample02_R1.fq.gz,sample02_R2.fq.gz,

Run-Level Reference Files

nextflow run doh-jdj0303/vaper \
  -r v2.0 \
  -profile docker \
  --input samplesheet.csv \
  --outdir results \
  --ref_file /path/to/reference.fa.gz

Genome Assembly

VAPER creates genome assemblies by aligning reads to one or more reference and calling the consensus at each reference position. This is accomplished using BWA MEM, Samtools, and iVar.

VAPER v2.0 no longer supports the CDC IRMA assembler.

Learn more about how to adjust assembly parameters here.

Assembly Modes

VAPER comes with multiple default assembly modes. These modes primarily differ in how they handle reference positions with mixed read support.

The -t parameter used by iVar to control the minimum frequency threshold to make a consensus call does not behave as described in the documentation. Instead of controlling the frequency of bases needed to call a consensus, it actually controls the frequency of reads supported by the consensus, as described in this issue. Future versions of VAPER will work to address this limitation.

Standard Mode

standard mode returns the most common base observed among the reads (the plurality).

parameter value

cons_allele_qual 20

cons_allele_ratio 0.0

cons_allele_depth 10

cons_max_depth 100

parameter	value
cons_allele_qual	20
cons_allele_ratio	0.0
cons_allele_depth	10
cons_max_depth	100

Mixed Mode

mixed mode includes minor allelic varation in the form of mixed IUPAC codes. This mode may be useful for environmental or pooled samples. The minimum and maximum read depth parameters are increased to account for multiple nucleotide calls at each position.

parameter value

cons_allele_qual 20

cons_allele_ratio 0.8

cons_allele_depth 20

cons_max_depth 200

parameter	value
cons_allele_qual	20
cons_allele_ratio	0.8
cons_allele_depth	20
cons_max_depth	200

You can remove terminal Ns from each assembly using --cons_prune_termini. This is required when submitting some viral species to NCBI.

Assembly Quality

Assembly quality is evaluated using a custom script, vaper_stats.py. Quality metrics are reported relative to the reference genome used to create the assembly. Assemblies are automatically classified as PASS or FAIL based on the QC thresholds set using the --qc_depth and --qc_genfrac parameters.

Assemblies will still be saved if they fail QC!

Condensing Duplicate Assemblies

VAPER will occasionally produce multiple, near-identical assemblies (often >99.9% identity). This generally occurs when fragmented contigs from the de novo assembly map to multiple, closely related references. These assemblies are identified using a simple clustering approach (Sourmash + DBSCAN) and the minimum distance set by --cons_condist. Only the best assembly is returned, based on read coverage × depth.

Metagenomic Classification

VAPER performs an optional viral metagenomic analysis using sourmash gather and sourmash tax metagenome with the 21-mer viral NCBI database (Jan 2025). You can skip the metagenomic analysis using --metagenome false or supply alternative database files using --sm_db and --sm_taxa.

The static image and summary exclude unclassified sequences when calculating relative abundance. Sequences with relative abundance ≤ 1% are grouped into Other.

Example Outputs

Below are examples of the metagenomic output for a Twist Bioscience synthetic RNA control for Influenza A H1N1.

Krona Plot

Explore the interactive plot here.

Static Image

VAPER Summary

SPECIES_SUMMARY
91.2% Influenza A virus; 6.4% Escherichia virus P1; 6.4% uncultured human fecal virus