Outputs

  1. Overview
  2. Alignment Files
  3. Assembly Files
  4. Metagenome Files
    1. Sourmash Outputs
    2. Visualizations
  5. Quality Files
    1. Read QC
    2. Read Alignment
    3. Assembly Selection
  6. Mapped Reads
  7. Reference selection
  8. Summary Files
    1. VAPER Summary Columns

Overview

Below is an overview of the standard outputs produced by VAPER.

`${outdir}` 
├── `${sample}`
│   ├── align
│   │   ├── `${sample}`.bam
│   │   ├── `${sample}`.bam.bai
│   │   ├── refs.fa
│   │   └── refs.fa.fai
│   ├── assembly
│   │   └── `${sample}`_`${reference_name}`.fa.gz
│   ├── assembly_tidy
│   │   └── `${sample}`_`${reference_name}`.tidy.fa.gz
│   ├── metagenome
│   │   ├── `${sample}`.all-taxa.csv.gz
│   │   ├── `${sample}`.html
│   │   ├── `${sample}`.sm-meta.csv
│   │   └── `${sample}`.taxa-plot.jpg
│   ├── qc
│   │   ├── `${sample}`-`${reference_name}`.stats.txt
│   │   ├── `${sample}`-condensed.csv
│   │   ├── `${sample}`.coverage.txt
│   │   ├── `${sample}`.fastp.json
│   │   ├── `${sample}`_1_fastqc.html
│   │   └── `${sample}`_2_fastqc.html
│   ├── reads
│   │   ├── `${sample}`-`${reference}`_R1.fastq.gz
│   │   ├── `${sample}`-`${reference}`_R2.fastq.gz
│   │   ├── `${sample}`_R1.scrubbed.fastq.gz
│   │   └── `${sample}`_R2.scrubbed.fastq.gz
│   ├── ref-select
│   │   ├── `${sample}`.denovo.fa
│   │   └── `${sample}`.ref-summary.csv
│   └── summary
│       └── `${sample}`.json
├── VAPER-summary.csv
├── VAPER-summary.json
├── multiqc
│   └── ...
└── pipeline_info
    └── ...

Alignment Files

Read alignments and the corresponding reference(s) and index files are provided for visualization tools like IGV.

├── align
│   ├── `${sample}`.bam
│   ├── `${sample}`.bam.bai
│   ├── refs.fa
└── refs.fa.fai

Assembly Files

Final assemblies based on selected references.

├── assembly
│   └── `${sample}-${reference_name}`.fa.gz

A tidy version of the assembly is also saved if you run VAPER using the --cons_prune_termini or --cons_no_mixed_sites.

├── assembly
│   └── `${sample}-${reference_name}`.tidy.fa.gz

Metagenome Files

Multiple metagenomic outputs are provided. See descriptions below:

├── metagenome
│   ├── `${sample}`.all-taxa.csv.gz
│   ├── `${sample}`.html
│   ├── `${sample}`.sm-meta.csv
│   └── `${sample}`.taxa-plot.jpg

Sourmash Outputs

  • ${sample}.all-taxa.csv.gz is the summary file generated by sourmash gather
  • ${sample}.sm-meta.csv is the summary file generated by sourmash taxa metagenome`

Visualizations

  • ${sample}.html is a Krona plot generated from ${sample}.sm-meta.csv
  • ${sample}.taxa-plot.jpg shows the relative abundance of all classified sequences (excludes unclassified). Classifications with < 1% relative abundance are grouped under “Other”.

Quality Files

Files related to quality control are saved to a common directory. See descriptions below:

├── qc
│   ├── `${sample}`-`${reference_name}`.stats.txt
│   ├── `${sample}`-condensed.csv
│   ├── `${sample}`.coverage.txt
│   ├── `${sample}`.fastp.json
│   ├── `${sample}`_1_fastqc.html
│   └── `${sample}`_2_fastqc.html

Read QC

  • ${sample}.fastp.json is the QC summary of the raw and quality filtered reads provided by fastp.
  • ${sample}_{1,2}_fastqc.html are the QC files for the fastp-controlled reads generated by FASTQC.

Read Alignment

${sample}.coverage.txt and ${sample}-${reference}.stats.txt describe read mapping quality.

Assembly Selection

${sample}.condense_dist.csv and ${sample}.condense_summary.csv are the outputs generated by vaper-condense.py and used to determine whether two assemblies should be merged.

Mapped Reads

Reads that map to each reference genome are exported for downstream use. Human scrubbed reads are also saved here when using --scrub_reads true

├── reads
│   ├── `${sample}`-`${reference}`_R1.fastq.gz
│   ├── `${sample}`-`${reference}`_R2.fastq.gz
│   ├── `${sample}`_R1.scrubbed.fastq.gz
│   ├── `${sample}`_R2.scrubbed.fastq.gz

Reference selection

Files related to reference selection are saved to a common directory. See descriptions below:

├── ref-select
│   ├── `${sample}`.denovo.fa
│   └── `${sample}`.ref-summary.csv
  • ${sample}.denovo.fa is the de novo assembly generated by shovill when using accurate reference selection mode.
  • ${sample}.ref-summary.csv is a summary of the genome fraction of each reference covered by the de novo assembly contigs.

Summary Files

Summary files are provided both individually (by sample) or collectively (by run)

│   └── summary
│       └── `${sample}`.json
├── VAPER-summary.csv
├── VAPER-summary.json
  • ${sample}.json is a summary for each sample.
  • VAPER-summary.csv and VAPER-summary.json are the combined summaries for all samples included on a run. VAPER-summary.json contains more information than VAPER-summary.csv.

VAPER Summary Columns

Column Name Description
id The sample ID provided in the samplesheet followed by the sample replicate number (e.g., _T1).
species Reference species
segment Reference segment
reference Reference name
assembly_qc Automated reporting of assembly quality (PASS or FAIL) based on reference genome fraction (--qc_genfrac) and reference depth of coverage (--qc_depth).
assembly_qc_reason Explanation for QC failure.
assembly_variant Numbered version of an assembly. This should be be 1 of 1 unless there are multiple assemblies associated with a single species-segment combination for a sample. Multiple assemblies may arise from same-species co-infections or fragmented de novo assemblies affecting reference selection.
assembly_length Number of nucleotides in the sample assembly (includes non-ATCG).
assembly_read_depth Average number of reads supporting each position in the sample assembly.
assembly_genome_fraction The fraction of reference positions with a declared nucleotide in the sample.
assembly_identity Fraction of assembly nucleotides that match the reference.
assembly_ins Number of nucleotide insertions in the sample assembly compared to the reference.
assembly_del Number of nucleotide deletions in the sample assembly compared to the reference.
assembly_sub Number of nucleotide substitutions in the sample compared to the reference.
assembly_missing Number of unassigned nucleotides (N) in the sample assembly compared to the reference.
assembly_mixed Number of nucleotides assigned as something other than A, C, G, T, or N in the sample assembly compared to the reference.
assembly_pct_bases_mapped Percent of clean bases that mapped to the assembly reference genome.
assembly_reads_mapped Total number of clean reads that mapped to the assembly reference genome.
assembly_mean_mapped_read_length Average length of the clean reads that mapped to the assembly reference genome.
assembly_mean_mapped_read_quality Average Phred score of the clean reads that mapped to the sample reference genome.
filtered_q30_rate Percentage of bases in the clean reads that have a Phred score of 30 or greater.
filtered_read1_mean_length Average length of the clean forward reads.
filtered_read2_mean_length Average length of the clean reverse reads.
filtered_total_reads Number of clean reads (R1 + R2).
raw_q30_rate Percentage of bases in the raw reads that have a Phred score of 30 or greater.
raw_read1_mean_length Average length of the raw forward reads.
raw_read2_mean_length Average length of the raw reverse reads.
raw_total_reads Number of raw reads (R1 + R2).
species_summary Relative abundance of viral species detected in the sample reads using Sourmash. Species with ≤ 1% relative abundance are categorized as Other. Relative abundance is calculated using only the classified sequences.