Reference Sets

This page provides an overview of the use, structure, and creation of VAPER reference sets.

Using Reference Sets

Reference sets can be specified using the --ref_set parameter (learn more here)

Creating Reference sets

VAPER accepts reference sets in comma-separated (CSV) or JSON lines (JSONL) format. See below for examples of each:


CSV Format

The table below show column names that can be supplied when using comma-separated reference sets. The only required column is assembly - all others are optional. Additional columns can also be included but will always be interpreted as strings.

Column Name Description
assembly Path to reference assembly (Required)
taxon Taxonomic classification (default reference set uses genus level)
species Species-level classification (default reference set using NCBI taxonomy names)
segment Genome segment name (defaults to wg)
* All extra columns are placed in the metadata field of each reference. All values are interpreted as strings

Example:

references.csv:

assembly,taxon,species,segment,serotype
/path/to/assembly.fa.gz,Orthoreovirus,Orthoreovirus mammalis,3,T2

Reference sets supplied as CSV files are converted to JSONL format and saved to the outdir.


JSONL Format

Using JSONL format takes a bit more set up but is much more program-friendly and easier to share. In this format, each line represents a single reference genome. This line includes both the reference sequence and any associated metadata. Like the CSV format, sequence is the only required field; however, there are multiple other fields that can be included to improve reference filtering and post-assembly interpretation. The example below shows a reference line created using EPITOME:

Example

references.jsonl

{
  "taxon": "Orthoreovirus",
  "segment": "3",
  "variant": "1",
  "timestamp": "2025-11-22 00:31:27",
  "sequence": "GCTAATCGTCAGGATGAAGCGGATTCCAAGGAAGACAAAAGGCAAATCCAGCGGAAAAGGTAATGATTCAACATCGAGGTCCGACGATGGCTCGAGTCAATTGCGAGATAAGCAGAGCAATAAGGCTAGTCCGTCTACTGTGGAGCCTGGTACATCCAGTCGTGAGCAATATAAGGCTCGACCAGGAATTGCGTCCGTACAGAAAGCTACTGAGAGCGCAGAACTACCTATGAAAAACAATGATGAAGGGACGCCCGACAAGAAGGGAAATACCAAAGGCGAGTTAGTTAATGAGCATGTTGAGGCCAAAGATGAGGCAGATGATGCGACGAAGAAACAGGCGAAGGATACGGACAAAACCAAAGCACAAGTTACATATTCAGACACTGGCATTAATAATGCGAACGAACTGTCAAGATCAGGAAACGTGGATAACGAGGGCGGTAGTAATCAGAAGCCAATGTCTACTAGGATAGCTGAGGCGACATCTGCTATAGTATCAAAACATCCTGCGCGTGTGGGATTGCCGCCTACAGCCAGCAGTGGCCATGGATATCAATGTCATGTGTGCTCAGCAGTTCTATTCAGTCCTTTAGATTTGGATGCTCACGTTGCTTCACATGGCTTGCACGGGAACATGACACTGACATCGAGCGAGATTCAACGGCATATAACTGAATTTATTAGTTCATGGCAAAATCATCCTATCGTTCAGGTTTCAGCTGACGTCGAAAACAAGAAAACCGCTCAATTGCTGCATGCTGATACTCCTCGACTTGTTACTTGGGATGCAGGCTTATGCACCTCATTCAAGATCGTACCGATCGTGCCGGCACAGGTGCCTCAAGACGTACTGGCGTATACATTCTTCACCTCTTCGTACGCTATTCAGTCTCCGTTTCCTGAGGCTGCAGTGTCTAGAATTGTGGTGCATACGAGATGGGCATCTAACGTTGATTTCGACCGAGACTCATCCGTTATCATGGCGCCGCCCACGGAGAACAACATACATCTGTTTAAACAGTTATTGAATACTGAGACTTTGTCGGTGCGAGGCGCCAATCCGTTAATGTTCCGAGCGAATGTACTGCATATGCTATTAGAATTTGTATTGGACAACTTGTATCTGAATAGGCACACGGGGTTTTCCCAAGATCATACGCCATTCACTGAGGGTGCCAACTTACGCTCGCTTCCAGGCCCTGACGCTGAAAAATGGTACTCGATTATGTATCCAACGCGCATGGGTACGCCGAACGTGTCAAAGATATGCAATTTCGTCGCATCTTGTGTGAGAAATAGGGTTGGGCGATTCGATCGAGCTCAGATGATGAATGGAGCCATGTCGGAATGGGTGGATGTCTTCGAGACCTCAGACGCGTTAACCGTTTCTATTCGAGGCAGATGGATGGCTAGATTGGCTCGGATGAATATAAACCCAACAGAAATTGAGTGGGCGTTAACCGAGTGTGCTCAAGGATATGTAACTGTCACGAGTCCCTATGCGCCTAGCGTAAATAGACTGATGCCGTACCGGATTTCCAATGCCGAGCGACAGATCTCCCAGATAATCAGAATCATGAATATTGGCAATAATGCGACTGTGATACAACCCGTTCTGCAAGACATTTCGGTGCTTCTTCAACGCATATCACCACTCCAGATAGATCCAACCATCATTTCCAACACGATGTCAACAGTCTCTGAATCTACTACCCAAACTCTTAGCCCTGCATCGTCAATTTTGGGTAAATTGCGGCCAAGTAACTCGGATTTCTCTAGCTTCAGGGTCGCATTGGCCGGATGGCTTTATAATGGAGTCGTGACTACGGTGATTGACGATAGTTCATACCCCAAGGATGGTGGTAGCGTGACTTCGCTAGAAAATCTGTGGGATTTCTTCATTCTTGCACTTGCCCTGCCATTGACGACTGATCCGTGCGCTCCTGTGAAAGCGTTTATGACGTTGGCAAACATGATGGTTGGTTTTGAAACGATTCCTATGGATAATCAGATTTATACTCAGTCGCGACGTGCGAGCGCTTTCTCGACGCCTCATACTTGGCCGAGATGCTTCATGAACATTCAATTGATTTCTCCAATCGATGCTCCAATCTTGCGGCAGTGGGCTGAAATCATCCATCGATACTGGCCTAATCCCTCTCAGATTCGTTATGGCGCCCCGAATGTCTTCGGCTCGGCTAATCTGTTCACGCCACCTGAGGTATTGCTGCTACCCATTGACCATCAGCCAGCCAATGTAACTACACCGACTCTGGATTTCACCAATGAGCTGACTAATTGGCGTGCTCGTGTCTGCGAGCTGATGAAGAATCTCGTTGATAATCAACGGTATCAACCTGGATGGACGCAGAGCTTGGTTTCGTCAATGCGCGGAACGCTGGATAAATTGAAGCTGATCAAATCGATGACACCAATGTATCTACAACAGCTCGCTCCAGTGGAATTGGCTGTGATAGCTCCGATGCTGCCTTTTCCACCCTTCCAGGTGCCATACGTCCGTCTTGATCGTGATAGAGTACCCACAATGGTTGGAGTCACCCGTCAGTCCCGAGATACCATTACTCAACCCGCACTATCACTTTCAACAACTAATACTACTGTTGGTGTGCCATTAGCCCTGGATGCGAGAGCCATCACTGTTGCGTTATTATCAGGGAAGTATCCACCGGATCTGGTGACAAATGTGTGGTACGCTGATGCCATCTATCCAATGTATGCTGATACTGAAGTGTTTTCAAACCTTCAGCGAGACATGATTACCTGCGAGGCGGTTCAGACACTGGTGACCCTTGTGGCACAAATATCAGAGACTCAGTACCCCGTGGATAGATATCTTGATTGGATCCCATCATTGAGGGCATCAGCAGCGACAGCGGCGACTTTTGCTGAGTGGGTCAACACTTCGATGAAAACGGCTTTTGACTTGTCTGATATGCTGTTGGAGCCTCTATTGAGCGGTGATCCGAGGATGACTCAATTAGCTATTCAGTACCAGCAATACAATGGCCGGACGTTTAATGTTATACCTGAGATGCCTGGATCAGTTATCGCTGACTGCGTTCAACTGACAGCAGAAGTTTTTAATCATGAATATAATCTGTTCGGGATTGCACGAGGTGACATCATCATCGGACGTGTTCAGTCGACGCATTTGTGGTCACCGCTGGCTCCCCCACCTGATCTGGTCTTCGATCGTGACACACCAGGTGTTCATATTTTTGGGCGAGATTGTCGCATATCGTTTGGAATGAACGGCGCCGCCCCCATGATTAGAGATGAGACTGGCATGATGGTGCCTTTTGAAGGAAACTGGATCTTTCCACTAGCGCTCTGGCAAATGAACACGCGATACTTCAACCAGCAGTTCGATGCATGGATTAAGACGGGAGAACTGCGAATACGTATTGAGATGGGCGCCTACCCGTACATGCTGCATTATTACGATCCGCGTCAGTATGCCAACGCGTGGAACCTGACGTCCGCCTGGCTTGAGGAAATCACGCCGACGAGCATACCGTCTGTGCCTTTTATGGTGCCTATCTCCAGTGATCATGACATCTCCTCCGCTCCCGCTGTTCAATACATCATTTCAACTGAATACAACGATCGATCCCTGTTCTGTACTAACTCCTCATCTCCTCAGACCATCGCTGGACCAGATAAACATATTCCCGTCGAAAGGTACAACATTCTGACCAATCCTGACGCTCCGCCTACGCAAATACAGCTGCCTGAGGTTGTTGACTTGTATAACGTTGTCACACGCTATGCCTATGAGACTCCTCCCATCACCGCTGTTGTTATGGGTGTTCCTTGATCCTCATCCTCCCAACGGGTGCTAGAGCATCGCGCTCGATGCTAGTTGGGCCGATTCATC",
  "method": "consensus",
  "metadata": {
    "collection_date": {
      "min": 2011,
      "max": 2011
    },
    "geographic_region": [
      "Asia"
    ],
    "host": [
      "Hipposideros"
    ],
    "segment": [
      "3"
    ],
    "serotype": [
      "T2"
    ],
    "species": [
      "Orthoreovirus mammalis"
    ],
    "accessions": [
      "KT444574.1"
    ]
  }
}

The JSON line in this example has been expanded to a multi-line view to make it readable. Each JSON line in the reference set must be provided as a single line.

All metadata fields in the example above represent the raw data, as it appeared in NCBI or GISAID. This is why segment is included twice. The top-level segment is the standardized form, whereas metadata.segment are the raw forms.


The Default Reference Set

The default reference set used by VAPER was created using EPITOME v2.0 with a 2% divergence threshold. This threshold was selected based on results from varcraft, which showed that assembly quality tends to degrade when sample-reference divergence exceeds 5%.

EPITOME-derived references include rich metadata about each source sequence, such as species, collection_date, geographic_region, and serotype. Because this information is sourced from public databases, its accuracy is not guaranteed—please interpret it with caution ⚠️. Visit the reference search page to explore available references.