Ensembl Variation

About Variation Data | Database Description | Variation Sources | Variation Tables Description | Perl API | Variant Effect Predictor

Ensembl stores both structural variants and sequence variants including Single Nucleotide Polymorphisms (SNPs), insertions, deletions and somatic mutations for several species and makes this data available on the website, via the Perl API, and provides direct access to the underlying databases.

This data is also integrated with other sections of Ensembl, in particular we predict the effects of variants on the Ensembl transcripts for each species.

Variation species and data sources

Ensembl stores variation data for the following species, but note that users can still use the Variant Effect Predictor on species for which we do not currently have a variation database.

The majority of variants are imported from NCBI dbSNP. The data is imported when it is released by dbSNP and incorporated into the next Ensembl release. If dbSNP releases the data on a different assembly, Ensembl will remap the variant positions onto the current assembly. Data from projects like the HapMap Project and 1000 Genomes Project is imported once it has been submitted to dbSNP.

Ensembl also includes data from other sources. To view data from these sources in the browser go to a species Location page (e.g. for human), and click on the 'Configure this page' link on the left-hand side. The 'Germline variation' and 'Somatic mutations' sections contain a track list of all sources of variation data for that species.


Variation displays

Variation data can be viewed in the browser through pages such as:

Clicking on any variation on an Ensembl page will open a Variation tab with information about the flanking sequence and source for the selected variation. Links to linkage disequilibrium (LD) plots, phenotype information (for human) from EGA, OMIM and NHGRI and Ensembl genes and transcripts that include the variation can be found at the left of this tab. You may also view multiple genome alignments of various species, highlighting the variation. Ancestral sequences are included in this display.

Variation information can also be accessed using BioMart (gene or variation database), and the Perl API (variation databases).


Variation classes

We call the class of a variation according to its component alleles and its mapping to the reference genome, and then display this information on the website. Internally we use Sequence Ontology terms, but we map these to our own 'display' terms where common usage differs from the SO definition (e.g. our term SNP is closer to the SO term SNV). All the classes we call, along with their equivalent SO term are shown in the table below. We also differentiate somatic mutations from germline variations in the display term, prefixing the term with 'somatic'. API users can fetch either the SO term or the display term.


Ensembl term SO term SO accession
SNP SNV SO:0001483
somatic_SNV
substitution substitution SO:1000002
somatic_substitution
CNV copy_number_variation SO:0001019
somatic_CNV
insertion insertion SO:0000667
somatic_insertion
deletion deletion SO:0000159
somatic_deletion
indel indel SO:1000032
somatic_indel
tandem_repeat tandem_repeat SO:0000705
somatic_tandem_repeat
sequence_alteration sequence_alteration SO:0001059
somatic_sequence_alteration

Insertion and Deletion coordinates

In Ensembl, an insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand is indicated with start and end coordinates as follows:

   12601     12600   

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand will have start and end coordinates of :

   12600     12602    

Predicted variation consequences

For each variation that is mapped to the reference genome, we identify any Ensembl transcripts that overlap the variation and use a rule-based approach to predict the effect that each allele of the variation may have on the transcript. The set of consequence terms, defined by the Sequence Ontology (SO), that can be currently be assigned to each combination of an allele and a transcript is shown in the table below. Note that each allele of each variation may have a different effect in different transcripts.

This approach is applied to all germline variations and somatic mutations stored in the Ensembl variation databases (though we do not currently calculate consequences for structural variants). The resulting consequence type calls, along with information determined as part of the process, such as the cDNA and CDS coordinates, and the affected codons and amino acids in coding transcripts, are stored in the variation database and displayed on the website.

Prior to release 62 we used our own internal terms to describe the consequence types, and we continue to use these by default on the website. You can opt to see SO terms on some variation views using 'Configure this page', and VEP and API users can choose which terms to use. The SO terms are more specific than our display terms but we map all display terms to one or more SO terms, as shown in the table below. Where the NCBI have an equivalent term we also include it in this table (and again you can opt to use NCBI terms on some views via 'Configure this page' and in the VEP and API).

We follow the SO definition of the term as closely as possible, but in some cases (e.g. non_synonymous_codon) we have slightly generalised the definition to include variants that affect multiple codons as well as a single one. We are working with the SO to refine these definitions. The descriptions shown in the table are our own, not the SO definitions.

The terms in the table are shown in order of severity as estimated by Ensembl, and this ordering is used on the website summary views. This ordering is necessarily subjective and API and VEP users can always get the full set of consequences for each allele and make their own severity judgement.

A diagram showing the location of each display term relative to the transcript structure is available.


Ensembl term Ensembl description SO term SO accession NCBI term
Essential splice site In the first 2 or the last 2 basepairs of an intron splice_acceptor_variant SO:0001574 splice-3
splice_donor_variant SO:0001575 splice-5
Stop gained In coding sequence, resulting in the gain of a stop codon stop_gained SO:0001587 nonsense
Stop lost In coding sequence, resulting in the loss of a stop codon stop_lost SO:0001578 -
Complex in/del Insertion or deletion that spans an exon/intron or coding sequence/UTR border complex_change_in_transcript SO:0001577 -
Non-synonymous coding In coding sequence and results in an amino acid change in the encoded peptide sequence initiator_codon_change SO:0001582 -
inframe_codon_gain SO:0001651 -
inframe_codon_loss SO:0001652 -
non_synonymous_codon SO:0001583 missense
Frameshift coding In coding sequence, resulting in a frameshift frameshift_variant SO:0001589 frameshift
Splice site 1-3 bps into an exon or 3-8 bps into an intron splice_region_variant SO:0001630 -
Partial codon Located within the final, incomplete codon of a transcript whose end coordinate is unknown incomplete_terminal_codon_variant SO:0001626 -
Synonymous coding In coding sequence, not resulting in an amino acid change (silent mutation) stop_retained_variant SO:0001567 -
synonymous_codon SO:0001588 cds-synon
Coding unknown In coding sequence with indeterminate effect coding_sequence_variant SO:0001580 -
Within mature miRNA Located within a microRNA mature_miRNA_variant SO:0001620 -
5 prime UTR In 5 prime untranslated region 5_prime_UTR_variant SO:0001623 untranslated_5
3 prime UTR In 3 prime untranslated region 3_prime_UTR_variant SO:0001624 untranslated_3
Intronic In intron intron_variant SO:0001627 intron
NMD transcript Located within a transcript predicted to undergo nonsense-mediated decay NMD_transcript_variant SO:0001621 -
Within non-coding gene Located within a gene that does not code for a protein nc_transcript_variant SO:0001619 -
Upstream Within 5 kb upstream of the 5 prime end of a transcript 2KB_upstream_variant SO:0001636 near-gene-5
5KB_upstream_variant SO:0001635 -
Downstream Within 5 kb downstream of the 3 prime end of a transcript 500B_downstream_variant SO:0001634 near-gene-3
5KB_downstream_variant SO:0001633 -
Intergenic More than 5 kb either upstream or downstream of a transcript intergenic_variant SO:0001628 -

Protein function predictions

For human mutations that are predicted to result in an amino acid substitution we also provide SIFT and PolyPhen predictions for the effect of this substitution on protein function. We compute the predictions for each of these tools for all possible single amino acid substitutions in the Ensembl human protein set. This means we can provide predictions for novel mutations for VEP and API users. We were able to compute predictions from at least one tool for over 95% of the proteins in Ensembl. We also use the Condel tool to provide a consensus prediction based on the SIFT and PolyPhen predictions.

These tools are developed by external groups and we provide a brief explanation of the approach each takes below, and how we run it in Ensembl. For much more detail please see the representative papers listed below, and the relevant publications available on each tool's website. We hope to be able to provide amino acid substitution predictions for species other than human in future releases.

SIFT

SIFT predicts whether an amino acid substitution is likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. The data we provide for each amino acid substitution is a score and a qualitative prediction (either 'tolerated' or 'deleterious'). The score is the normalized probability that the amino acid change is tolerated so scores nearer 0 are more likely to be deleterious. The qualitative prediction is derived from this score such that substitutions with a score < 0.05 are called 'deleterious' and all others are called 'tolerated'.

We ran SIFT version 4.0.3 (available here) following the instructions from the authors and used SIFT to choose homologous proteins rather than supplying them ourselves. We used all protein sequences available from UniProtKB (both the SwissProt and TrEMBL sets) as the protein database. All data was downloaded from UniProt in May 2011.

PolyPhen

PolyPhen-2 predicts the effect of an amino acid substitution on the structure and function of a protein using sequence homology, Pfam annotations, 3D structures from PDB where available, and a number of other databases and tools (including DSSP, ncoils etc.). As with SIFT, for each amino acid substitution where we have been able to calculate a prediction, we provide both a qualitative prediction (one of 'probably damaging', 'possibly damaging', 'benign' or 'unknown') and a score. The PolyPhen score represents the probability that a substitution is damaging, so values nearer 1 are more confidently predicted to be deleterious (note that this the opposite to SIFT). The qualitative prediction is based on the False Positive Rate of the classifier model used to make the predictions.

We ran PolyPhen-2 version 2.0.23 (available here) and again we followed all instructions from the authors, and used the UniProtKB UniRef100 non-redundant protein set as the protein database, which was downloaded, along with PDB structures, and annotations from Pfam and DSSP in February 2011. When computing the predictions we used the classifier model trained on the HumDiv training set (please refer to the PolyPhen publications for more details of the classification system). For any Ensembl translations that were updated between releases 62 and 63 we used data downloaded on May 2011.

Condel

Condel is a general method for calculating a consensus prediction from the output of tools designed to predict the effect of an amino acid substitution. It does so by calculating a weighted average score of the scores of each component method. The Condel authors provided us with a version specialised for finding a consensus between SIFT and PolyPhen and we integrated this into the variation API. Tests run by the authors on the HumVar dataset (a test set curated by the PolyPhen team), show that Condel can improve both the sensitivity and specificity of predictions compared to either SIFT or PolyPhen used alone (please contact the authors for details). The Condel score, along with a qualitative prediction (one of 'neutral' or 'deleterious'), are available in the VEP and via the API. The Condel score is the consensus probability that a substitution is deleterious, so values nearer 1 are predicted with greater confidence to affect protein function.


The Variant Effect Predictor (VEP)

Users can run all the analyses described above on novel mutations using the Ensembl Variant Effect Predictor (VEP). Provided you have a list of allele sequences and associated genomic coordinates, you can use the VEP to calculate consequences with respect to the Ensembl transcript set for your species of interest, and (for human) to retrieve SIFT, PolyPhen and Condel predictions for any missense variations. The VEP can also identify if a variation is co-located with a known variation (i.e. already found in dbSNP or any of the other sources of variation data listed above), and so can be used to filter your data for novel loci. The VEP works for any species in Ensembl and not just those for which we have an existing variation database.

The VEP can either be run online from the Tools section of the website or you can download a script to run yourself on larger datasets. You can also write custom scripts using the Perl API. All of these approaches use the same backend code as is used to calculate the consequences displayed on the website. Please refer to the VEP documentation for more details on running the VEP, including supported input formats, or the variation API documentation for help writing custom scripts. The VEP is under active development and feedback is welcome!

See the full documentation for more details.


Variation sets

We use the concept of variation sets to group variations that share some property together. For example, we have grouped the variations identified in the three different 1000 genomes pilot studies into separate variation sets. The sets can be further subdivided into supersets and subsets to reflect hierarchical relationships between them. In the case of the 1000 genomes pilot sets, these are divided into subsets based on population. For example, the set representing variations identified in the first 1000 genomes pilot study is named '1000 genomes - Low coverage' and has three subsets: '1000 genomes - Low coverage - CEU', '1000 genomes - Low coverage - CHB+JPT' and '1000 genomes - Low coverage - YRI'. The variation sets can be displayed as separate tracks on the location view. This behaviour is controlled from the 'Germline variations' section on configuration panel which is accessed by clicking the 'Configure this page' link in the left hand side navigation.

The sets are constructed during production and are stored in the database. The table below lists the available variation sets in e!63 (subsets are indicated by bullet points).

Name Short name Description
Clinical/LSDB variations from dbSNP precious Variations that belong to a reserved or "precious" set of clinically associated SNPs from dbSNP [http://www.ncbi.nlm.nih.gov/projects/SNP/]
ENSEMBL:Venter ind_venter Variants genotyped in Craig Venter
ENSEMBL:Watson ind_watson Variants genotyped in James Watson
HapMap hapmap Variations which have been assayed by The International HapMap Project [http://hapmap.ncbi.nlm.nih.gov/]
1000 genomes - Low coverage 1kg_lc Variations called by the 1000 genomes project on low coverage sequence data from 179 unrelated individuals (Pilot 1)
  • 1000 genomes - Low coverage - YRI
1kg_lc_yri Variations called by the 1000 genomes project on low coverage sequence data from 59 unrelated YRI individuals (Pilot 1)
  • 1000 genomes - Low coverage - CEU
1kg_lc_ceu Variations called by the 1000 genomes project on low coverage sequence data from 60 unrelated CEU individuals (Pilot 1)
  • 1000 genomes - Low coverage - CHB+JPT
1kg_lc_chb_jpt Variations called by the 1000 genomes project on low coverage sequence data from 60 unrelated CHB individuals and 60 unrelated JPT individuals (Pilot 1)
1000 genomes - High coverage - Trios 1kg_hct Variations called by the 1000 genomes project on high coverage sequence data from two family trios (Pilot 2)
  • 1000 genomes - High coverage - Trios - CEU
1kg_hct_ceu Variations called by the 1000 genomes project on high coverage sequence data from a CEU family trio (two parents and one daughter) (Pilot 2)
  • 1000 genomes - High coverage - Trios - YRI
1kg_hct_yri Variations called by the 1000 genomes project on high coverage sequence data from a YRI family trio (two parents and one daughter) (Pilot 2)
1000 genomes - High coverage exons 1kg_hce Variations called by the 1000 genomes project on high coverage sequence data of 8,140 exons from 906 randomly selected genes in 697 individuals (Pilot 3)
  • 1000 genomes - High coverage exons - JPT
1kg_hce_jpt Variations called by the 1000 genomes project on high coverage sequence data of exons from 105 JPT individuals (Pilot 3)
  • 1000 genomes - High coverage exons - CHB
1kg_hce_chb Variations called by the 1000 genomes project on high coverage sequence data of exons from 109 CHB individuals (Pilot 3)
  • 1000 genomes - High coverage exons - LWK
1kg_hce_lwk Variations called by the 1000 genomes project on high coverage sequence data of exons from 108 LWK individuals (Pilot 3)
  • 1000 genomes - High coverage exons - YRI
1kg_hce_yri Variations called by the 1000 genomes project on high coverage sequence data of exons from 112 YRI individuals (Pilot 3)
  • 1000 genomes - High coverage exons - CEU
1kg_hce_ceu Variations called by the 1000 genomes project on high coverage sequence data of exons from 90 CEU individuals (Pilot 3)
  • 1000 genomes - High coverage exons - CHD
1kg_hce_chd Variations called by the 1000 genomes project on high coverage sequence data of exons from 107 CHD individuals (Pilot 3)
  • 1000 genomes - High coverage exons - TSI
1kg_hce_tsi Variations called by the 1000 genomes project on high coverage sequence data of exons from 66 TSI individuals (Pilot 3)
Phenotype-associated variants ph_variants Variants that have been associated with a phenotype
  • Johnson & O'Donnell phenotype variants
ph_johnson_et_al Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
  • NHGRI catalog phenotype variants
ph_nhgri Variants associated with phenotype data from the NHGRI GWAS catalog [http://www.genome.gov/gwastudies/]
  • HGMD-PUBLIC phenotype variants
ph_hgmd_pub Variants with phenotypes annotated by HGMD
  • EGA phenotype variants
ph_ega Variants imported from the European Genome-phenome Archive with phenotype association
  • OMIM phenotype variants
ph_omim Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • Uniprot phenotype variants
ph_uniprot Variations with phenotype annotations provided by Uniprot
  • COSMIC phenotype variants
ph_cosmic Phenotype annotations of somatic mutations found in human cancers from the COSMIC project
Failed variations fail_all Variations that have failed the Ensembl QC checks
  • Failed: Multiple mappings
fail_mult_map Variation maps to more than 3 different locations
  • Failed: No mapping
fail_no_map Variation does not map to the genome
  • Failed: No genotypes
fail_no_gt Variation has no genotypes
  • Failed: Alleles do not match reference
fail_nonref None of the variant alleles match the reference allele
  • Failed: Genotype frequencies
fail_gt_fq Genotype frequencies do not add up to 1
  • Failed: No alleles
fail_no_alleles Loci with no observed variant alleles in dbSNP
  • Failed: No sequence
fail_no_seq Variation has no associated sequence
  • Failed: Non-nucleotide alleles
fail_non_nt Alleles contain non-nucleotide characters
  • Failed: Too many alleles
fail_mult_alleles Variation has more than 3 different alleles
  • Failed: Inconsistent mapping
fail_incons_map Mapped position is not compatible with reported alleles
  • Failed: Ambiguous alleles
fail_ambig Alleles contain ambiguity codes

Notes

For more detailed information on the variation specific help see here.
There is also more information about programmatic access in the variation API tutorial page.


References

McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F.
Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.
BMC Bioinformatics26(16):2069-70(2010)
doi:10.1093/bioinformatics/btq330

Rios D, McLaren WM, Chen Y, Birney E, Stabenau A, Flicek P, Cunningham F.
A Database and API for variation, dense genotyping and resequencing data
BMC Bioinformatics 11:238 (2010)
doi:10.1186/1471-2105-11-238

Chen Y, Cunningham F, Rios D, McLaren WM, Smith J, Pritchard B, Spudich GM, Brent S, Kulesha E, Marin-Garcia P, Smedley D, Birney E, Flicek P.
Ensembl Variation Resources
BMC Genomics 11(1):293 (2010)
doi:10.1186/1471-2164-11-293

Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR.
A method and server for predicting damaging missense mutations
Nature Methods 7(4):248-249 (2010)
doi:10.1038/nmeth0410-248

Kumar P, Henikoff S, Ng PC.
Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm
Nature Protocols 4(8):1073-1081 (2009)
doi:10.1038/nprot.2009.86

Gonzalez-Perez A, Lopez-Bigas N.
Improving the assessment of the outcome of non-synonymous SNVs with a Consensus deleteriousness score (Condel)
Am J Hum Genet 88(4):440-449 (2011)
doi:10.1016/j.ajhg.2011.03.004

Redon, R. et al.
Global variation in copy number in the human genome
Nature 444:444-454 (2006)
doi:10.1038/nature05329

Spencer, C. C. A. et al.
The Influence of Recombination on Human Genetic Diversity
PLoS Genet. 2(9):e148 (2006)
doi:10.1371/journal.pgen.0020148

Venter, J. C. et al.
The Sequence of the Human Genome
Science 291(5507):1304-51 (2001)
doi:10.1126/science.1058040

Wheeler, D. A. et al.
The complete genome of an individual by massively parallel DNA sequencing
Nature 452:872-876 (2008)
doi:10.1038/nature06884