Overview

  1. What is VAPER?
    1. Flowchart
    2. Key Features
    3. Contributors
  2. Inputs
    1. Read Quality
    2. Read Downsampling
    3. Downloading Reads from SRA
  3. Reference Selection
    1. Automated Reference Selection
      1. accurate mode
      2. fast mode
      3. kitchen-sink mode 🚽
    2. Manual Reference Selection
      1. Example: File path
      2. Example: Reference name
      3. Example: Multiple references
  4. Genome Assembly
    1. Assembly Parameters
    2. Assembly Quality
    3. Genome Assemblers
      1. ivar assembler
      2. irma assembler
    4. Condensing Duplicate Assemblies
  5. Metagenomic Classification
    1. Example Outputs
      1. Krona Plot
      2. Static Image
      3. VAPER Summary

What is VAPER?

Flowchart

Key Features

VAPER (Viral Assembly from Probe-based Enrichment) is a viral (meta-)assembly pipeline that can:

🧬 Assemble genomes from complex samples, supporting multiple assemblies per sample (e.g., co-infections)
🧬 Automatically detect and select reference genomes
🧬 Predict the taxonomy of each assembly, with an optional viral metagenomic summary
🧬 Export reads associated with each assembly for downstream use

While VAPER was originally designed for hybrid capture data (e.g., Illumina VSP or Twist CVRP), it has also been used with shotgun metagenomic and tile-amplicon data. It comes stock with comprehensive reference sets for [loading] taxa, including influenza A, SARS-CoV-2, and Monkeypox (full list). Keep on reading to learn more!

Contributors

VAPER was originally created by the Washington State Department of Health (WA DOH) as part of the Pathogen Genomics Center of Excellence (PGCoE). Check out the links below to learn more:

Inputs

VAPER performs basic quality control on the reads prior to reference selection and genome assembly. References are not quality controlled.

Read Quality

Read quality is evaluated and managed using FastQC and fastp. Fastp metrics are reported in the final summary. FastQC metrics can be found in the MultiQC report.

Read Downsampling

Reads can be optionally downsampled prior to read quality using the --max_reads parameter (default: 2000000 reads). This is accomplished using seqtk sample.

Downloading Reads from SRA

VAPER can download reads automatically from the NCBI Sequence Read Archive (SRA) (see example below).

samplesheet.csv

sample,sra
sample01,SRR28460430

These must be paired-end Illumina reads.

Reference Selection

VAPER can automatically select references for you and/or you can tell VAPER which references to use.  

Automated Reference Selection

VAPER comes with multiple automated reference selection modes (--refs_mode ( accurate | fast | kitchen-sink )). accurate and fast modes use a reference set supplied using the --refs parameter. kitchen-sink mode downloads references from NCBI using the metagenomic summary.

Learn more about how reference sets are created here

accurate mode

As the name implies, reference selection using accurate mode is more accurate but slower. References are selected by mapping contigs from a de novo assembly to the entire reference set using minimap2 -x asm5 --secondary=no. The -x asm5 flag means that contigs will only map to references that share approx. 95% nucleotide identity. The --secondary=no flag means that contigs will only map to the closest matching reference (no multi-mapping). Together, these parameters allow VAPER to choose reference(s) that best match the sample. Tweaking the parameters below may improve results if you run into any issues:

  • --ref_genfrac: controls the minimum percent of a reference that must be mapped by one or more de novo contig for it to be selected for consensus generation.
  • --denovo_assembler: controls which tool is used for de novo assembly. Options include megahit, spades, velvet, and skesa.
  • --denovo_contigcov: controls the minimum coverage required for a contig to be included in the de novo assembly.
  • --denovo_contiglen: controls the minimum contig length required for a contig to be included in the de novo assembly.

fast mode

Also aptly named, fast mode is faster but less accurate (how much faster is up for debate). This approach uses sourmash gather to determine which reference(s) in the reference set are best represented in the raw reads. This mode has not been thoroughly tested. You can adjust this primarily using the --ref_genfrac parameter.

kitchen-sink mode 🚽

kitchen-sink mode attempts to build assemblies for each taxon identified in the metagenomic summary. Genome assemblies associated with each taxon are downloaded from NCBI using the NCBI datasets tool. Only assemblies that are listed as complete are included. Assemblies containing multiple contigs are split into individual FASTA files to avoid concatenating segmented viruses. As you can imagine, this method of reference selection can be very unreliable and should therefore be used with caution ⚠️.

Manual Reference Selection

You can tell VAPER which reference(s) to use for each sample by supplying a file path and/or the name of a reference in a reference set. This information is included in the references column of the samplesheet for each sample (see the examples below).

Example: File path

samplesheet.csv:

sample,fastq_1,fastq_2,reference
sample01,sample01_R1.fastq.gz,sample01_R2.fastq.gz,/path/to/ref.fa.gz

Example: Reference name

The example below would match reference Betacoronavirus-wg-1.fa.gz in the default reference set.

samplesheet.csv:

sample,fastq_1,fastq_2,reference
sample01,sample01_R1.fastq.gz,sample01_R2.fastq.gz,Betacoronavirus-wg-1

Example: Multiple references

Multiple references can be supplied for each sample using a semicolon (;). You can supply file paths and reference names at the same time 🙌!

samplesheet.csv:

sample,fastq_1,fastq_2,reference
sample01,sample01_R1.fastq.gz,sample01_R2.fastq.gz,Betacoronavirus-wg-1;/path/to/ref.fa.gz

Genome Assembly

VAPER creates genome assemblies by aligning reads to a reference genome and calling the consensus or plurality at each reference position (default: consensus). This can be accomplished using either ivar or irma (default: --cons_assembler ivar).

The de novo assembly used for ‘accurate’ reference selection is not used in the final reference-based assembly.

Assembly Parameters

Learn more about how to adjust assembly parameters here.

You can fine tune your IRMA assembly by adjusting the multitude of parameters located here: assets/IRMA_MODULE/init.sh. This file will be automatically copied into the work directory and used to build the IRMA module. However, MIN_AMBIG, MIN_CONS_SUPPORT, MIN_CONS_QUALITY, and SKIP_E must be adjusted from the command line using --cons_allele_ratio, --cons_allele_depth, --cons_allele_qual, and --cons_assembly_elong, respectively.

Assembly Quality

Assembly quality is evaluated using Nextclade and custom scripts. Quality metrics are reported relative to the reference genome used to create the assembly. Assemblies are automatically classified as PASS or FAIL based on the QC thresholds set using the --qc_depth and --qc_genfrac parameters. Assemblies will still be saved if they fail QC!

VAPER reports the number of bases in the sample missing from the termini of the reference in a separate column (ASSEMBLY_TERMINI_GAPS) because this is not included in the missing bases count by Nextclade.

Genome Assemblers

ivar assembler

iVar is the default assembler used by VAPER. Reads are aligned to the reference genome using BWA MEM and the alignment pileup is passed to ivar for consensus generation.

irma assembler

IRMA is a CDC-developed viral assembler with some nifty features 🌼. The main selling point with IRMA is that it can iteratively adjust the reference genome to more closely match the sample reads, therefore resulting in a more accurate assembly. This is accomplished using a liberal initial read alignment approach that allows for greater sample-reference divergence. IRMA can also elongate the reference during the refinement process, ideally resulting in a more “complete” assembly (--cons_elong true; use with caution ⚠️). Out of the box, IRMA is limited to a select number of species: influenza, ebolavirus, and coronavirus. This is because IRMA requires species-specific “modules” that contain a defined set of reference genomes. We can get around this by building these modules on the fly using the VAPER references 😎.

Another important difference with IRMA is that is produces multiple different assembly types, including a consensus assembly, like iVar, along with a plurality assembly and a padded assembly. Each have their own considerations (read more here). VAPER returns the consensus assembly by default (--cons_assembly_type consensus).

iVar and IRMA can perform differently depending on the input data. Learn more here.

Condensing Duplicate Assemblies

VAPER will occasionally produce multiple, near-identical assemblies (often >99.9% identity). This generally occurs when fragmented contigs from the de novo assembly map to multiple, closely related references (--ref_mode accurate only). VAPER uses a custom process (see vaper-condense.py) to identify and merge assemblies that are highly similar, helping reduce redundancy in the final output. This process uses MinHash-based comparisons (via Sourmash) to calculate pairwise distances between assemblies and groups them using hierarchical clustering. Assemblies within a user-defined similarity threshold (default 98% ANI; --cons_condist) are grouped, and the most complete and supported representative (based on coverage × depth) is selected from each cluster. The result is a condensed set of assemblies that preserves diversity while removing near-identical entries. This improves efficiency and interpretability in downstream analyses.

Metagenomic Classification

VAPER performs a basic viral metagenomic analysis using sourmash gather and sourmash tax metagenome with the 21-mer viral GenBank database. You can supply alternative database files using --sm_db and --sm_taxa.

  • The static image and summary exclude unclassified sequences when calculating relative abundance. Sequences with relative abundance ≤ 1% are grouped into Other.
  • The term metagenomic is used liberally here. A better term would probably be a metaviromic analysis, as it only includes viral species.

Example Outputs

Below are examples of the metagenomic output for a Twist Bioscience synthetic RNA control for Influenza A H1N1.

Krona Plot

Explore the interactive plot here.

Static Image

VAPER Summary

|SPECIES_SUMMARY| |:-| |91.2% Influenza A virus; 6.4% Escherichia virus P1; 6.4% uncultured human fecal virus|