Reference Sets
This page provides an overview of the use, structure, and creation of VAPER reference sets.
Using Reference Sets
Reference sets can be specified using the --ref_set parameter (learn more here)
Creating Reference sets
VAPER accepts reference sets in comma-separated (CSV) or JSON lines (JSONL) format. See below for examples of each:
CSV Format
The table below show column names that can be supplied when using comma-separated reference sets. The only required column is assembly - all others are optional. Additional columns can also be included but will always be interpreted as strings.
| Column Name | Description |
|---|---|
| assembly | Path to reference assembly (Required) |
| taxon | Taxonomic classification (default reference set uses genus level) |
| species | Species-level classification (default reference set using NCBI taxonomy names) |
| segment | Genome segment name (defaults to wg) |
| * | All extra columns are placed in the metadata field of each reference. All values are interpreted as strings |
Example:
references.csv:assembly,taxon,species,segment,serotype /path/to/assembly.fa.gz,Orthoreovirus,Orthoreovirus mammalis,3,T2
Reference sets supplied as CSV files are converted to JSONL format and saved to the outdir.
JSONL Format
Using JSONL format takes a bit more set up but is much more program-friendly and easier to share. In this format, each line represents a single reference genome. This line includes both the reference sequence and any associated metadata. Like the CSV format, sequence is the only required field; however, there are multiple other fields that can be included to improve reference filtering and post-assembly interpretation. The example below shows a reference line created using EPITOME:
Example
references.jsonl
{
"taxon": "Orthoreovirus",
"segment": "3",
"variant": "1",
"timestamp": "2025-11-22 00:31:27",
"sequence": "GCTAATCGTCAGGATGAAGCGGATTCCAAGGAAGACAAAAGGCAAATCCAGCGGAAAAGGTAATGATTCAACATCGAGGTCCGACGATGGCTCGAGTCAATTGCGAGATAAGCAGAGCAATAAGGCTAGTCCGTCTACTGTGGAGCCTGGTACATCCAGTCGTGAGCAATATAAGGCTCGACCAGGAATTGCGTCCGTACAGAAAGCTACTGAGAGCGCAGAACTACCTATGAAAAACAATGATGAAGGGACGCCCGACAAGAAGGGAAATACCAAAGGCGAGTTAGTTAATGAGCATGTTGAGGCCAAAGATGAGGCAGATGATGCGACGAAGAAACAGGCGAAGGATACGGACAAAACCAAAGCACAAGTTACATATTCAGACACTGGCATTAATAATGCGAACGAACTGTCAAGATCAGGAAACGTGGATAACGAGGGCGGTAGTAATCAGAAGCCAATGTCTACTAGGATAGCTGAGGCGACATCTGCTATAGTATCAAAACATCCTGCGCGTGTGGGATTGCCGCCTACAGCCAGCAGTGGCCATGGATATCAATGTCATGTGTGCTCAGCAGTTCTATTCAGTCCTTTAGATTTGGATGCTCACGTTGCTTCACATGGCTTGCACGGGAACATGACACTGACATCGAGCGAGATTCAACGGCATATAACTGAATTTATTAGTTCATGGCAAAATCATCCTATCGTTCAGGTTTCAGCTGACGTCGAAAACAAGAAAACCGCTCAATTGCTGCATGCTGATACTCCTCGACTTGTTACTTGGGATGCAGGCTTATGCACCTCATTCAAGATCGTACCGATCGTGCCGGCACAGGTGCCTCAAGACGTACTGGCGTATACATTCTTCACCTCTTCGTACGCTATTCAGTCTCCGTTTCCTGAGGCTGCAGTGTCTAGAATTGTGGTGCATACGAGATGGGCATCTAACGTTGATTTCGACCGAGACTCATCCGTTATCATGGCGCCGCCCACGGAGAACAACATACATCTGTTTAAACAGTTATTGAATACTGAGACTTTGTCGGTGCGAGGCGCCAATCCGTTAATGTTCCGAGCGAATGTACTGCATATGCTATTAGAATTTGTATTGGACAACTTGTATCTGAATAGGCACACGGGGTTTTCCCAAGATCATACGCCATTCACTGAGGGTGCCAACTTACGCTCGCTTCCAGGCCCTGACGCTGAAAAATGGTACTCGATTATGTATCCAACGCGCATGGGTACGCCGAACGTGTCAAAGATATGCAATTTCGTCGCATCTTGTGTGAGAAATAGGGTTGGGCGATTCGATCGAGCTCAGATGATGAATGGAGCCATGTCGGAATGGGTGGATGTCTTCGAGACCTCAGACGCGTTAACCGTTTCTATTCGAGGCAGATGGATGGCTAGATTGGCTCGGATGAATATAAACCCAACAGAAATTGAGTGGGCGTTAACCGAGTGTGCTCAAGGATATGTAACTGTCACGAGTCCCTATGCGCCTAGCGTAAATAGACTGATGCCGTACCGGATTTCCAATGCCGAGCGACAGATCTCCCAGATAATCAGAATCATGAATATTGGCAATAATGCGACTGTGATACAACCCGTTCTGCAAGACATTTCGGTGCTTCTTCAACGCATATCACCACTCCAGATAGATCCAACCATCATTTCCAACACGATGTCAACAGTCTCTGAATCTACTACCCAAACTCTTAGCCCTGCATCGTCAATTTTGGGTAAATTGCGGCCAAGTAACTCGGATTTCTCTAGCTTCAGGGTCGCATTGGCCGGATGGCTTTATAATGGAGTCGTGACTACGGTGATTGACGATAGTTCATACCCCAAGGATGGTGGTAGCGTGACTTCGCTAGAAAATCTGTGGGATTTCTTCATTCTTGCACTTGCCCTGCCATTGACGACTGATCCGTGCGCTCCTGTGAAAGCGTTTATGACGTTGGCAAACATGATGGTTGGTTTTGAAACGATTCCTATGGATAATCAGATTTATACTCAGTCGCGACGTGCGAGCGCTTTCTCGACGCCTCATACTTGGCCGAGATGCTTCATGAACATTCAATTGATTTCTCCAATCGATGCTCCAATCTTGCGGCAGTGGGCTGAAATCATCCATCGATACTGGCCTAATCCCTCTCAGATTCGTTATGGCGCCCCGAATGTCTTCGGCTCGGCTAATCTGTTCACGCCACCTGAGGTATTGCTGCTACCCATTGACCATCAGCCAGCCAATGTAACTACACCGACTCTGGATTTCACCAATGAGCTGACTAATTGGCGTGCTCGTGTCTGCGAGCTGATGAAGAATCTCGTTGATAATCAACGGTATCAACCTGGATGGACGCAGAGCTTGGTTTCGTCAATGCGCGGAACGCTGGATAAATTGAAGCTGATCAAATCGATGACACCAATGTATCTACAACAGCTCGCTCCAGTGGAATTGGCTGTGATAGCTCCGATGCTGCCTTTTCCACCCTTCCAGGTGCCATACGTCCGTCTTGATCGTGATAGAGTACCCACAATGGTTGGAGTCACCCGTCAGTCCCGAGATACCATTACTCAACCCGCACTATCACTTTCAACAACTAATACTACTGTTGGTGTGCCATTAGCCCTGGATGCGAGAGCCATCACTGTTGCGTTATTATCAGGGAAGTATCCACCGGATCTGGTGACAAATGTGTGGTACGCTGATGCCATCTATCCAATGTATGCTGATACTGAAGTGTTTTCAAACCTTCAGCGAGACATGATTACCTGCGAGGCGGTTCAGACACTGGTGACCCTTGTGGCACAAATATCAGAGACTCAGTACCCCGTGGATAGATATCTTGATTGGATCCCATCATTGAGGGCATCAGCAGCGACAGCGGCGACTTTTGCTGAGTGGGTCAACACTTCGATGAAAACGGCTTTTGACTTGTCTGATATGCTGTTGGAGCCTCTATTGAGCGGTGATCCGAGGATGACTCAATTAGCTATTCAGTACCAGCAATACAATGGCCGGACGTTTAATGTTATACCTGAGATGCCTGGATCAGTTATCGCTGACTGCGTTCAACTGACAGCAGAAGTTTTTAATCATGAATATAATCTGTTCGGGATTGCACGAGGTGACATCATCATCGGACGTGTTCAGTCGACGCATTTGTGGTCACCGCTGGCTCCCCCACCTGATCTGGTCTTCGATCGTGACACACCAGGTGTTCATATTTTTGGGCGAGATTGTCGCATATCGTTTGGAATGAACGGCGCCGCCCCCATGATTAGAGATGAGACTGGCATGATGGTGCCTTTTGAAGGAAACTGGATCTTTCCACTAGCGCTCTGGCAAATGAACACGCGATACTTCAACCAGCAGTTCGATGCATGGATTAAGACGGGAGAACTGCGAATACGTATTGAGATGGGCGCCTACCCGTACATGCTGCATTATTACGATCCGCGTCAGTATGCCAACGCGTGGAACCTGACGTCCGCCTGGCTTGAGGAAATCACGCCGACGAGCATACCGTCTGTGCCTTTTATGGTGCCTATCTCCAGTGATCATGACATCTCCTCCGCTCCCGCTGTTCAATACATCATTTCAACTGAATACAACGATCGATCCCTGTTCTGTACTAACTCCTCATCTCCTCAGACCATCGCTGGACCAGATAAACATATTCCCGTCGAAAGGTACAACATTCTGACCAATCCTGACGCTCCGCCTACGCAAATACAGCTGCCTGAGGTTGTTGACTTGTATAACGTTGTCACACGCTATGCCTATGAGACTCCTCCCATCACCGCTGTTGTTATGGGTGTTCCTTGATCCTCATCCTCCCAACGGGTGCTAGAGCATCGCGCTCGATGCTAGTTGGGCCGATTCATC",
"method": "consensus",
"metadata": {
"collection_date": {
"min": 2011,
"max": 2011
},
"geographic_region": [
"Asia"
],
"host": [
"Hipposideros"
],
"segment": [
"3"
],
"serotype": [
"T2"
],
"species": [
"Orthoreovirus mammalis"
],
"accessions": [
"KT444574.1"
]
}
}
The JSON line in this example has been expanded to a multi-line view to make it readable. Each JSON line in the reference set must be provided as a single line.
All metadata fields in the example above represent the raw data, as it appeared in NCBI or GISAID. This is why segment is included twice. The top-level segment is the standardized form, whereas metadata.segment are the raw forms.
The Default Reference Set
The default reference set used by VAPER was created using EPITOME v2.0 with a 2% divergence threshold. This threshold was selected based on results from varcraft, which showed that assembly quality tends to degrade when sample-reference divergence exceeds 5%.
EPITOME-derived references include rich metadata about each source sequence, such as species, collection_date, geographic_region, and serotype. Because this information is sourced from public databases, its accuracy is not guaranteed—please interpret it with caution ⚠️. Visit the reference search page to explore available references.